An enjoyed kernel apprentice Just another WordPress weblog

February 27, 2017

RAID1: avoid unnecessary spin locks in I/O barrier code

Filed under: Block Layer Magic,kernel — colyli @ 11:56 am

When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and wait_barrier() code,

|        – 41.41%  fio [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
|             – _raw_spin_lock_irqsave
|                         + 89.92% allow_barrier
|                         + 9.34% __wake_up
|        – 37.30%  fio [kernel.kallsyms]  [k] _raw_spin_lock_irq
|              – _raw_spin_lock_irq
|                         – 100.00% wait_barrier

The reason is, in these I/O barrier related functions,

– raise_barrier()
– lower_barrier()
– wait_barrier()
– allow_barrier()

They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form.

In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions,

  • wait_barrier()

Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed.

  • wait_read_barrier()

Since regular read I/O won’t interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf->nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn’t need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).


Thanks to Shaohua and Neil, very patient to explain memory barrier and atomic operations to me, help me to compose this patch in a correct way. This patch is merged into Linux v4.11 with commit ID 824e47daddbf.

RAID1: a new I/O barrier implementation to remove resync window

Filed under: Block Layer Magic,kernel — colyli @ 11:42 am

‘Commit 79ef3a8aa1cb (“raid1: Rewrite the implementation of iobarrier.”)’ introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don’t need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time.

The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep “Fixes: 79ef3a8aa1” in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regression.

Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,

  • Do not maintain a global unique resync window
  • Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa.
  • I/O barrier can be reduced to an acceptable number if there are enough barrier buckets

Here I explain how the barrier buckets are designed,


The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.

Bio requests won’t go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough.

Neil Brown also points out that for resync operation, “we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame”. For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync.


There are BARRIER_BUCKETS_NR buckets in total, which is defined by,


this patch makes the bellowed members of struct r1conf from integer to array of integers,

– int     nr_pending;
– int     nr_waiting;
– int     nr_queued;
– int     barrier;
+ int   *nr_pending;
+ int   *nr_waiting;
+ int   *nr_queued;
+ int   *barrier;

number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT – 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size.

  • I/O barrier bucket is indexed by bio start sector

If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function,

+    static inline int sector_to_idx(sector_t sector)
+   {
+            return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
+                                             BARRIER_BUCKETS_NR_BITS);
+   }

Here sector_nr is the start sector number of a bio.

  • Single bio won’t go across boundary of a I/O barrier unit

If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size.

Comparing to single sliding resync window,

  • Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window.
  • But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice.
  • Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation.


Great thanks to Shaohua and Neil, review the code, point out many bugs, and provide very useful suggestion. Finally we make it, this patch is merged in Linux v4.11 with commit ID fd76863e37fe.

October 13, 2016

Why 4KB I/O requests are not merged on DM target

Filed under: File System Magic,kernel — colyli @ 10:00 am

(This article is for SLE11-SP3, which is based on Linux 3.0 kernel.)

Recently people report that on SLE11-SP3, they observe I/O requests are not merged on device mapper target, and ‘iostat’ displays average request only in 4KB size.

This is not a bug, no negative performance impact. Here I try to explain why this situation is not a bug and how it happens, a few Linux file system and block layer stuffs will be mentioned, but it won’t be complexed to understand.

The story is, from a SLE11-SP3 machine, a LVM volume is created (as linear device mapper target), and ‘dd’ is used to generate sequential WRITE I/Os on to this volume. People tried to use buffered I/O and direct I/O with ‘dd’ command, on raw device mapper target, or on an ext3 file system on top of the device mapper target. So there are 4 conditions,

1) buffered I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M
2) direct I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M oflag=direct
3) buffered I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M
4) direct I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M oflag=direct

For 2) and 4), large request sizes are observed from hundreds to thousands sectors, maximum request size is 2048 sectors (because bs=1M). But for 1) and 3), all the request sizes displayed from ‘iostat’ on device mapper target dm-0 are 8 sectors (4KB).

The question is, sequential write I/Os are supposed to be merged into larger ones, why the request size reported by ‘iostat’ from device mapper target /dev/dm-0 is only 4KB, and not merged into larger request size ?

At first, let me give the simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.

Let me explain the above 2 points of simple answer in details. For direct I/O, the request size in device mapper target is the actual size sent from upper layer, it might be directly from application buffer, or adjusted by file system. so we only look at buffered I/O cases.

a) device mapper target does not merge bios
Device mapper only handles bios. In case of linear device mapper target (a common & simple lvm volume), it only re-maps the original bio from the logical device mapper target to actual underlying storage device, or maybe split the bio (of the device mapper target) into smaller ones if the original bio goes across multiple underlying storage devices. It never combines small bios into larger ones, it just re-maps the bios, and submit them to underlying block layer. Elevator, a.k.a I/O scheduler handles request merging and scheduling, device mapper does not.

b) upper layer code issues only 4KB size bios
For buffered I/O, file system only dirties the pages which contains the data writing to disk, the actual write action is handled by write back and journal code automatically,
– journal: ext3 uses jbd to handle journaling, in data=ordered mode, jbd only handles meta data blocks, and submit the metadata I/Os in buffer head, which means the maximum size is one page (4KB).
– write back: the actual kernel code to submit I/O to disk in write back code path is mm/page-writeback.c:do_writepages(). In SLE11-SP3 it looks like this,

1084 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
1085 {
1086         int ret;
1088         if (wbc->nr_to_write <= 0)
1089                 return 0;
1090         if (mapping->a_ops->writepages)
1091                 ret = mapping->a_ops->writepages(mapping, wbc);
1092         else
1093                 ret = generic_writepages(mapping, wbc);
1094         return ret;
1095 }

Device mapper target is created on devtmpfs, which does not have writepages() method defined. Ext3 does not have writepages() method defined in its a_ops set neither, so both conditions will go into generic_writepages().

Inside generic_writepages(), the I/O code path is: generic_writepages()==>write_cache_pages()==>__writepage()==>mapping->a_ops->writepage(). For different conditions, the implementation of mapping->a_ops->writeback() are different.

b.1) raw device mapper target
In SLE11-SP3, block device mapping->a_ops->writepage() is defined in fs/block_dev.c:blkdev_writepage(), its code path to submit I/O is: blkdev_writepage()==>block_write_full_page()==>block_write_full_page_endio()==>__block_write_full_page(). In __block_write_full_page(), finally a buffer head contains this page is submitted to underlying block layer by submit_bh(). So device mapper layer only receives bio with 4KB size in this case.

b.2) ext3 file system on top of raw device mapper target
In SLE11-SP3, mapping->a_ops->writeback() method in ext3 file system is defined in three ways, corresponding to three different data journal modes. Here I use data=ordered mode as the example. Ext3 uses jbd as its journaling infrastructure, when journal works in data=ordered mode (the default mode in SLE11-SP3), mapping->a_ops->writeback() is defined as fs/ext3/inode.c:ext3_ordered_writepage(). Inside this function, block_write_full_page() is called to write page to block layer, same to the raw device mapper target condition, finally submit_bh() is called to submit bio with one page to device mapper layer. Therefore in this case, device mapper target still only receives bios with 4KB size.

Finally let’s back to my first simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.


June 17, 2016

My DCTC2016 talk: Linux MD RAID performance improvement since 3.11 to 4.6

Filed under: Great Days,kernel — colyli @ 3:11 am

This week I was invited by Memblaze to give a talk on Data Center Technology Conference 2016 about Linux MD RAID performance on NVMe SSD. In the past 3 years, Linux community make a lot of effort to improve MD RAID performance on high speed media, especially on RAID456. I happen to maintain block layer for SUSE Linux, back port quite a lot patches back to Linux 3.12.

From this talk, I list a selected recognized effort from Linux kernel community on MD RAID5 performance improvement, and how much performance data is increased by each patch (set), it looks quite impressive. Many people contribute their talent on this job, I am glad to say “Thank you all ” !


A slide in Mandarin of this talk can be found here, currently I don’t have time to translate it in English, maybe several months later …

April 26, 2016

libelf-devel is required when building kernel module

Filed under: Basic Knowledge,kernel — colyli @ 10:19 pm

For most documents about kernel module building just need a Makefile like this,

HELLO = helloworld

obj-m += $(HELLO).o

$(HELLO)-objs := hello.o world.o

KERNEL_SOURCE := /lib/modules/`uname -r`/build/


        $(MAKE) -C $(KERNEL_SOURCE) M=`pwd` modules


        $(MAKE) -C $(KERNEL_SOURCE) M=`pwd` clean

        $(RM) Module.markers modules.order

Then type “make” will make everything set. (Of cause source of current kernel is ready at /lib/modules/`uname -r`/build).

But yesterday when I tried to build a kernel module for Linux-4.6-rc5 (openSUSE vanilla) kernel, I observed an error never saw before.

helloworld> make

make -C /lib/modules/`uname -r`/build/ M=`pwd` modules

make[1]: Entering directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

make[2]: *** No rule to make target ‘/home/colyli/source/tmp/helloworld/hello.o’, needed by ‘/home/colyli/source/tmp/helloworld/helloworld.o’.  Stop.

Makefile:1428: recipe for target ‘_module_/home/colyli/source/tmp/helloworld’ failed

make[1]: *** [_module_/home/colyli/source/tmp/helloworld] Error 2

make[1]: Leaving directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

Makefile:10: recipe for target ‘default’ failed

make: *** [default] Error 2

It seems nothing missed, but the error message was there.  Today a friend (Chang Liu from Memblaze) tells me maybe I should check the output of “make modules_prepare” in the kernel source directory, here is my result,

linux-4.6-rc5-vanilla> make modules_prepare

Makefile:1016: “Cannot use CONFIG_STACK_VALIDATION, please install libelf-dev or elfutils-libelf-devel”

This is an informative clue, so I install libelf-dev package and re-run “make modules_prepare”,

linux-4.6-rc5-vanilla> make modules_prepare

  CHK     include/config/kernel.release

  CHK     include/generated/uapi/linux/version.h

  CHK     include/generated/utsrelease.h

  CHK     include/generated/bounds.h

  CHK     include/generated/timeconst.h

  CHK     include/generated/asm-offsets.h

  CALL    scripts/

  DESCEND  objtool

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/fixdep.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/fixdep-in.o

  LINK     /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/fixdep

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/exec-cmd.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/help.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/pager.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/parse-options.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/run-command.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/sigchain.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/subcmd-config.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/libsubcmd-in.o

  AR       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/libsubcmd.a

  GEN      /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/arch/x86/insn/inat-tables.c

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/arch/x86/decode.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/arch/x86/objtool-in.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/builtin-check.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/elf.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/special.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/objtool.o

  CC       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/libstring.o

  LD       /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/objtool-in.o

  LINK     /home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla/tools/objtool/objtool

No complain anymore, then I back to kernel module source directory, run “make” again,

helloworld> make

make -C /lib/modules/`uname -r`/build/ M=`pwd` modules

make[1]: Entering directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

  CC [M]  /home/colyli/source/tmp/helloworld/hello.o

  CC [M]  /home/colyli/source/tmp/helloworld/world.o

  LD [M]  /home/colyli/source/tmp/helloworld/helloworld.o

  Building modules, stage 2.

  MODPOST 1 modules

  CC      /home/colyli/source/tmp/helloworld/helloworld.mod.o

  LD [M]  /home/colyli/source/tmp/helloworld/helloworld.ko

make[1]: Leaving directory ‘/home/colyli/source/suse-kernel/patched/linux-4.6-rc5-vanilla’

Everything is cool, the kernel module is built. So installing libelf-devel package solve the problem.

But it is still not clear to me, why missing libelf-devel may cause kernel module building failed. If you know the reason, please give me a hint. Thanks in advance.

March 13, 2014

A six lectures OS kernel development course

Filed under: Basic Knowledge,kernel — colyli @ 12:17 pm

Since year 2010 after I joined Taobao (a subsidiary of Alibaba Group), I help my employer to build a Linux kernel team, to maintain in-house Linux kernel and optimize system performance continuously. The team grew from 1 person to 10 persons in the next 2 years, we made some successful stories by internal projects, while having 200+ patches merged into upstream Linux kernel.

In these 2 years, I found most of programmers had just a little concept on how to write code to cooperate with Linux kernel perfectly. And I found I was not the only person had similar conclusion. A colleague of mine, Zhitong Wang, a system software engineer from Ali Cloud (another subsidiary company of Alibaba Group), asked me whether I had interest to design and promote a course on OS kernel development, to help other junior developers to write better code on Linux servers. We had more then 100K real hardware servers online, if we could help other developers to improve 1% performance in their code, no doubt it would be extremely cool.

Very soon, we agreed on the outline of this course. This was a six lectures course, each one taking 120 ~ 150 minutes,


  • First class: Loading Kernel

This class introduced how a runnable OS kernel was loaded by boot loader and how the first instruction of the kernel was executed.

  • Second class: Protected Mode Programming

This class introduced very basic concept on x86 protect mode programming, which was fundamental to rested four classes.

  • Third class: System Call

This class explained how to design and implement system call interface, how priority transfer was happened.

  • Forth class: Process scheduling

We expected people was able to understand how a simplest scheduler was working and how context switch was made.

  • Fifth class: Physical Memory Management

In this class people could have a basic idea that how memory size was detected, how memory was managed before buddy system initialized, how buddy and slab system working.

  • Sixth class: Virtual Memory Management

Finally there were enough back ground knowledge to introduce how memory map, virtual memory area, page fault was designed and implemented, there was also a few slide pages introduces TLB and huge pages.


In next 6 months, Zhitong and I finished first version of  all slides. When Alibaba training department knew we were preparing an OS kernel development training, they helped us to arrange time slots both in Beijing and Hangzhou (Alibaba Group office location). We did the first wave training in 4 months, around 30 persons attended each class. We received a lot of positive feed back beyond our expectation. Many colleagues told me they were too busy to attend all these six classes, and required us to arrange this course again.

This was great encouragement to us. We knew the training material could be better, we yet had better method to make audience understand kernel development more. By this motivation, with many helpful suggestions from Zhitong, I spent half year to re-write all slide pages for all six classes, to make the materials to be more logical, consistent and scrutable.

Thanks to my employer, I may prepare course material in working hours, and accomplish the second wave training earlier. In last two classes, the teaching room was full, even some people had to stand for hours. Again, many colleagues complained they were too busy to miss some of the classes, and asked me to arrange another wave sometime in future.

This is not an easy task, I gave 6 classes both in Beijing and Hangzhou, including Q&A it was more than 30 hours. But I decide to arrange another wave of the course again, maybe start in Oct 2014, to show my honor to all people who helped and encouraged me 🙂

Here you may find all slide files for these six classes, they are written in simplified Chinese.
[There are more than enough document in English, but in Chinese the more the better ]

* Class 1: osdev1-loading_kernel
* Class 2: osdev2-protected_mode_programming
* Class 3: osdev3-system_call
* Class 4: osdev4-process_scheduling
* Class 5: osdev5-physical_memory_management
* Class 6: osdev6-virtual_memory_management

August 23, 2013

openSuSE Conference 2013 in Thessaloniki, Greece

Filed under: Great Days — colyli @ 6:18 am


In recent months, I worked on hard disk I/O latency measurement for our cloud service infrastructure. The initial motivation is to identify almost-broken-but-still-work hard disks, and isolate them from online services. In order to avoid modify core kernel data structure and execution paths, I hacked device mapper module to measure the I/O latency. The implementation is quite simple, just add timestamp “unsigned long start_time_usec” into struct dm_io, when all sub-io of dm_io completed, calculate latency and store it into corresponded data structure.

4 +++ linux-latency/drivers/md/dm.c

13 @@ -60,6 +61,7 @@ struct dm_io {
14 struct bio *bio;
15 unsigned long start_time;
16 spinlock_t endio_lock;
17 + unsigned long start_time_usec;
18 };

After running on around 10 servers from several different cloud services, there are some interesting data and situation observed, which may be helpful for us to identify the relationship between I/O latency and hard disk healthy condition.
It happens that openSuSE Conference 2013 is about to take place in Thessaloniki, Greece, a great opportunity for me to share the interesting data to friends and other developers from openSuSE community.


olympic_museum-1 olymplic_museum-2
Thessaloniki is a beautiful costal city, it is an enjoyed experience that openSuSE conference happens here. The venue is a sports museum (a.k.a Olympic Museum), very nice place for a community conference. When I entered the museum one day early, I saw many volunteers (someone I knew from SuSE and someone I didn’t know who were local community members), they were busy to prepare many stuffs from meeting rooms to booth. I joined to help a little for half day, then back to hotel to prepare my talk slide.


prepare-0prepare-3 prepare-2

prepare-5 prepare-4 prepare-1


This year, I did better, the slide was accomplished 8 hours before my talk, last time in Prague it was 4~5 hours before 🙂 Much more people showed up beyond my expectation, during and after the talk, a lot communication happened. Some people also suggested me to update the data in next year openSuSE conference. This project is still in quite early stage, I will continue to update information in next time.


This year, I didn’t meet many friends who live in German or Czech Republic, maybe it is because long distance travel and too hot weather. Fortunately one hacker I met this year helped me a lot, he is Olive Neukum. We talked a lot on seq_lock implementation in Linux kernel, he inspired me an idea on non-lock-confliction implementation for seq_lock when reading clock resource in ktime_get(). The idea is simple: if seq number changed after reading the data, just ignore the data and return, do not try again. Because in latency sampling, there is no need to measure I/O latency for every I/O request, if the sampling may be random (lock conflict could be treat as kind of random), the statistic result is still reliable.  Oliver also gave a talk on “speculative execution”, introduced basic idea of speculative execution and the support in glibc and kernel. This is one of the most interesting talks IMHO 🙂

During the conference, there were many useful communication happened, e.g. I talked with Andrew Wafaa about possible ARM cooperation in China, with Max Huang about open source promotion, with Izabel Valverde for travel support program. This year, there was a session talked about openSuSE TSP (Travel Support Program) status update. IMHO, all updates about TSP makes this program to be more sustainable, e.g. more explicit travel policy, asking sponsored people to help as volunteer for organization. Indeed, before TSP mentioned this update, I did in this way for years 🙂 Thanks to openSuSE Travel Support Program, to help me to meet community friends every year, and have the opportunity to share ideas with other hackers and community members.

volunteer ralf
Like Ralf Flaxa said, openSuSE community has its own dependent power and grows healthily. openSuSE conference 2013 is the first time that it happens in a city where no SuSE office located.  I saw many people from local community helped on vane preparation, organization, management, only a few people are SuSE employees. This is impressive, I really feel the power of community, people just show up, take their own role and lead. Next year openSuSE conference 2014 will be in Dubrovnik of Croatia, I believe the community will continue to organize another great event, of cause I will join and help in my way.


[1] slide of my talk can be found here,

[2] live video of my talk, starts from 1:17:30

October 30, 2012

openSUSE Conference 2012 in Prague

Filed under: Great Days — colyli @ 4:59 am

In Oct 20 ~ 23, I was invited and sponsored by openSUSE to give a talk on openSUSE Conference (OSC2012). The venue was Czech Technical University in Prague, Czech Republic, a beautiful university (without wall) in a beautiful city.

It was 5 years ago since last time I visited Prague (for SuSE Labs conference), as well as 3 years ago since last time I attended openSUSE conference as speaker, which was OSC2009 in Nuremberg. In OSC 2009, the topic of my talk was “porting openSUSE to MIPS platform”, this was a Google summer of code project accomplished by Eryu Guan (being Redhat employee after graduated). At that time, almost all active kernel developers from China were hired by multi-national companies, few local company (not include university and institute) in China contributed patch to Linux kernel. In year 2009, after Wensong Zhang (original author of Linux Virtual Server) joined Taobao, this local e-business company was willing to optimize Linux kernel for their online servers and contribute patches back to Linux kernel community. IMHO, this was a small but important change in China, it should be my honor if I was able to be involved into this change. Therefore in June 2010, I left SuSE Labs and joined Taobao, to help this company to build a kernel engineering team.

From the first day since the team was built, the team and I applicate many ideas which I learned/learn from SuSE/openSUSE kernel engineering. E.g. how to corporate with kernel community, how to organize kernel patches, how to integrate kernel patches and kernel tree with build system. After 2+ years, with great support from Wensong and other senior managers, Taobao kernel team grows to 10 persons, we contribute 160+ patches into Linux upstream kernel, becoming one of the most active Linux kernel development teams in China. Colleagues from other departments and product lines recognize the value of Linux kernel maintenance and performance optimization, while we open all projects information and kernel patches to people outside the company. With the knowledge learned from openSUSE engineering, we lay a solid foundation on Taobao kernel development/maintenance procedure.

This time the topic of my talk is “Linux kernel development/maintenance in Taobao — what we learn from openSUSE engineering“, this is an effort to say “Thank you” to openSUSE community. Thanks to openSUSE conference organization team, I have the opportunity to introduce what we learn from openSUSE and contribute to community in past 2+ years. The slide file can be downloaded here, if any one is interested on this talk.

Back to openSUSE conference 2 years later is a happy and sweet experience, especially meeting many old friends whom we worked together for years. I met people from YaST team, server team and SuSE Labs, as well as some ones no longer serve for SUSE but still active in opneSUSE community. Thanks to the conference organization team again, to make us have the rare and unique chance to do face-to-face communication, especially for community members like me who is not located in Europe and has to take oversea travel.

The conference venue in first 2 days was inside building of FIT ČVUT (Faculty of Information Technology of Czech Technical University in Prague). There were many meeting rooms available inside the build, so that dozen of talks, seminar, BOF were able to happen concurrently. I have to say, in order to accommodate 600+ registered audience, choosing such a large venue is really a great idea. In Monday the venue moved to another building, though there were less meeting room, the main room (where my talk was in) was bigger.



CPU power talk by Thomas Renninger

Cgroup usage by Petr Baudiš

After talking with many speakers out of the meeting room, and chair a BOF of Linux Cgroup (control group, especially forcus on memory and I/O control), there were some non-linux-kernel talks abstracted me quite a lot. Though all the slides and video records can be found from internet (thanks to organization team again ^_^), I would like to share the talk by Thijs de Vries, which impressed me among many excellent talks.



Thijs de Vries: Gamification – using game elements and tactics in a non-game context

Thijs de Vries was from a game design company (correct me if I am wrong), in this talk he explained many design principles and practices in the company. He mentioned when they planed to design a game, there were 3 objects to considerate, which in turn were project, procedure and product. A project was built for the plan, a procedure was set during the project execution, a product was shipped as the output of the project. I do like this idea for design, it’s something new and helpful to me. Then he introduced how to make people have fun, involved into the game, and understand the knowledge from the game. In Thijs’ talk, it seems designing funny rules and goals is not difficult, but IMHO an educational game with funny rules and social goals is not easy to design even with every hard and careful effort. From his talk, I strongly felt innovation and genius of design (indeed not only game) from a different way which I never met and imagined before.

Beside orthodox conference talks, a lot conversation also happened outside the meeting room. Alexander Graf mentioned the effort to enable SUSE Linux on ARM boxes, which was a very interesting topic for people who looking for low power hardware like me. For some workload from Taobao, powerful x86 CPU does not help any more to performance, replacing them with low power ARM CPU may save lot of money on power and thermal expenditure. Currently the project seems going well, I hope the product may be shipped in the near future. Jiaju Zhang also introduced his proposal on a distributed clustering protocol which called Booth. We talked about the idea of Booth last year, it was good to see this idea came to a real project step by step. As a file system developer, some discussion about btrfs and OCFS2 happened with SuSE Labs people as well. For btrfs it was unanimous that this file system was not ready for large scale deployment yet, people from Fujitsu, Oracle, SuSE, Redhat, and other organizations were working hard to improve the quality to product usage. For OCFS2, we talked about file system freeze among cluster, there was little initial effort since last 2 years, a very incipient idea was discussed on how to freeze write I/O among each node in the cluster. It seems OCFS2 is in maintenance status currently, hope someday I (or someone else) have time and interest to work on this interesting and useful feature.
This article just part of my experience from openSUSE conference. OSC2012 was well organized, included but not limited to schedule, venue, video record, meal, travel, hotel, .etc. Here I should thank several people who help me to attend the great conference once again,

  • People behind, who accept my proposal
  • People behind, who kindly offer the sponsorship for my travel
  • Stella Rouzi, who helped me on visa application
  • Andreas Jaeger, Lars Muller, and other people who encourage me to give a talk on OSC2012.
  • Alexader Graf and others who review my slide

Finally, if you have interest to find more information about openSUSE conference 2012, these URL may be informative,

Conference schedule:
Conference video:
Slide of my talk:
Video of my talk:


February 7, 2011

alloc_sem of Ext4 block group

Filed under: File System Magic — colyli @ 11:53 am

Yesterday Amir Goldstein sent me an email for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, he provides a quite clear answer. I feel Ted’s answer is also very informative to me, I copy&past the conversation from to my blog. The copy rights of the bellowed referenced text belong to their original authors.

On Sun, Feb 06, 2011 at 10:43:58AM +0200, Amir Goldstein wrote:
> When looking at alloc_sem, I realized that it is only needed to avoid
> race with adjacent group buddy initialization.
Actually, alloc_sem is used to protect all of the block group specific
data structures; the buddy bitmap counters, adjusting the buddy bitmap
itself, the largest free order in a block group, etc.  So even in the
case where block_size == page_size, alloc_sem is still needed!
– Ted

November 22, 2010

Three Practical System Workloads of Taobao

Filed under: kernel — colyli @ 9:44 pm

Days ago, I gave a talk on an academic seminar at ACT of Beihang University ( In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed product lines. The introduction was quite brief, no detail touched here. we don’t mind to share what we did imperfectly, and we would like to open mind to cooperate with open source community and industries to improve 🙂

If you find there is anything unclear or misleading, please let me know. Communication makes things better most of time 🙂

[The slide file can be found here]

Older Posts »

Powered by WordPress