An enjoyed kernel apprentice Just another WordPress weblog

June 17, 2016

My DCTC2016 talk: Linux MD RAID performance improvement since 3.11 to 4.6

Filed under: Great Days,kernel — colyli @ 3:11 am

This week I was invited by Memblaze to give a talk on Data Center Technology Conference 2016 about Linux MD RAID performance on NVMe SSD. In the past 3 years, Linux community make a lot of effort to improve MD RAID performance on high speed media, especially on RAID456. I happen to maintain block layer for SUSE Linux, back port quite a lot patches back to Linux 3.12.

From this talk, I list a selected recognized effort from Linux kernel community on MD RAID5 performance improvement, and how much performance data is increased by each patch (set), it looks quite impressive. Many people contribute their talent on this job, I am glad to say “Thank you all ” !


A slide in Mandarin of this talk can be found here, currently I don’t have time to translate it in English, maybe several months later …

August 23, 2013

openSuSE Conference 2013 in Thessaloniki, Greece

Filed under: Great Days — colyli @ 6:18 am


In recent months, I worked on hard disk I/O latency measurement for our cloud service infrastructure. The initial motivation is to identify almost-broken-but-still-work hard disks, and isolate them from online services. In order to avoid modify core kernel data structure and execution paths, I hacked device mapper module to measure the I/O latency. The implementation is quite simple, just add timestamp “unsigned long start_time_usec” into struct dm_io, when all sub-io of dm_io completed, calculate latency and store it into corresponded data structure.

4 +++ linux-latency/drivers/md/dm.c

13 @@ -60,6 +61,7 @@ struct dm_io {
14 struct bio *bio;
15 unsigned long start_time;
16 spinlock_t endio_lock;
17 + unsigned long start_time_usec;
18 };

After running on around 10 servers from several different cloud services, there are some interesting data and situation observed, which may be helpful for us to identify the relationship between I/O latency and hard disk healthy condition.
It happens that openSuSE Conference 2013 is about to take place in Thessaloniki, Greece, a great opportunity for me to share the interesting data to friends and other developers from openSuSE community.


olympic_museum-1 olymplic_museum-2
Thessaloniki is a beautiful costal city, it is an enjoyed experience that openSuSE conference happens here. The venue is a sports museum (a.k.a Olympic Museum), very nice place for a community conference. When I entered the museum one day early, I saw many volunteers (someone I knew from SuSE and someone I didn’t know who were local community members), they were busy to prepare many stuffs from meeting rooms to booth. I joined to help a little for half day, then back to hotel to prepare my talk slide.


prepare-0prepare-3 prepare-2

prepare-5 prepare-4 prepare-1


This year, I did better, the slide was accomplished 8 hours before my talk, last time in Prague it was 4~5 hours before 🙂 Much more people showed up beyond my expectation, during and after the talk, a lot communication happened. Some people also suggested me to update the data in next year openSuSE conference. This project is still in quite early stage, I will continue to update information in next time.


This year, I didn’t meet many friends who live in German or Czech Republic, maybe it is because long distance travel and too hot weather. Fortunately one hacker I met this year helped me a lot, he is Olive Neukum. We talked a lot on seq_lock implementation in Linux kernel, he inspired me an idea on non-lock-confliction implementation for seq_lock when reading clock resource in ktime_get(). The idea is simple: if seq number changed after reading the data, just ignore the data and return, do not try again. Because in latency sampling, there is no need to measure I/O latency for every I/O request, if the sampling may be random (lock conflict could be treat as kind of random), the statistic result is still reliable.  Oliver also gave a talk on “speculative execution”, introduced basic idea of speculative execution and the support in glibc and kernel. This is one of the most interesting talks IMHO 🙂

During the conference, there were many useful communication happened, e.g. I talked with Andrew Wafaa about possible ARM cooperation in China, with Max Huang about open source promotion, with Izabel Valverde for travel support program. This year, there was a session talked about openSuSE TSP (Travel Support Program) status update. IMHO, all updates about TSP makes this program to be more sustainable, e.g. more explicit travel policy, asking sponsored people to help as volunteer for organization. Indeed, before TSP mentioned this update, I did in this way for years 🙂 Thanks to openSuSE Travel Support Program, to help me to meet community friends every year, and have the opportunity to share ideas with other hackers and community members.

volunteer ralf
Like Ralf Flaxa said, openSuSE community has its own dependent power and grows healthily. openSuSE conference 2013 is the first time that it happens in a city where no SuSE office located.  I saw many people from local community helped on vane preparation, organization, management, only a few people are SuSE employees. This is impressive, I really feel the power of community, people just show up, take their own role and lead. Next year openSuSE conference 2014 will be in Dubrovnik of Croatia, I believe the community will continue to organize another great event, of cause I will join and help in my way.


[1] slide of my talk can be found here,

[2] live video of my talk, starts from 1:17:30

October 30, 2012

openSUSE Conference 2012 in Prague

Filed under: Great Days — colyli @ 4:59 am

In Oct 20 ~ 23, I was invited and sponsored by openSUSE to give a talk on openSUSE Conference (OSC2012). The venue was Czech Technical University in Prague, Czech Republic, a beautiful university (without wall) in a beautiful city.

It was 5 years ago since last time I visited Prague (for SuSE Labs conference), as well as 3 years ago since last time I attended openSUSE conference as speaker, which was OSC2009 in Nuremberg. In OSC 2009, the topic of my talk was “porting openSUSE to MIPS platform”, this was a Google summer of code project accomplished by Eryu Guan (being Redhat employee after graduated). At that time, almost all active kernel developers from China were hired by multi-national companies, few local company (not include university and institute) in China contributed patch to Linux kernel. In year 2009, after Wensong Zhang (original author of Linux Virtual Server) joined Taobao, this local e-business company was willing to optimize Linux kernel for their online servers and contribute patches back to Linux kernel community. IMHO, this was a small but important change in China, it should be my honor if I was able to be involved into this change. Therefore in June 2010, I left SuSE Labs and joined Taobao, to help this company to build a kernel engineering team.

From the first day since the team was built, the team and I applicate many ideas which I learned/learn from SuSE/openSUSE kernel engineering. E.g. how to corporate with kernel community, how to organize kernel patches, how to integrate kernel patches and kernel tree with build system. After 2+ years, with great support from Wensong and other senior managers, Taobao kernel team grows to 10 persons, we contribute 160+ patches into Linux upstream kernel, becoming one of the most active Linux kernel development teams in China. Colleagues from other departments and product lines recognize the value of Linux kernel maintenance and performance optimization, while we open all projects information and kernel patches to people outside the company. With the knowledge learned from openSUSE engineering, we lay a solid foundation on Taobao kernel development/maintenance procedure.

This time the topic of my talk is “Linux kernel development/maintenance in Taobao — what we learn from openSUSE engineering“, this is an effort to say “Thank you” to openSUSE community. Thanks to openSUSE conference organization team, I have the opportunity to introduce what we learn from openSUSE and contribute to community in past 2+ years. The slide file can be downloaded here, if any one is interested on this talk.

Back to openSUSE conference 2 years later is a happy and sweet experience, especially meeting many old friends whom we worked together for years. I met people from YaST team, server team and SuSE Labs, as well as some ones no longer serve for SUSE but still active in opneSUSE community. Thanks to the conference organization team again, to make us have the rare and unique chance to do face-to-face communication, especially for community members like me who is not located in Europe and has to take oversea travel.

The conference venue in first 2 days was inside building of FIT ČVUT (Faculty of Information Technology of Czech Technical University in Prague). There were many meeting rooms available inside the build, so that dozen of talks, seminar, BOF were able to happen concurrently. I have to say, in order to accommodate 600+ registered audience, choosing such a large venue is really a great idea. In Monday the venue moved to another building, though there were less meeting room, the main room (where my talk was in) was bigger.



CPU power talk by Thomas Renninger

Cgroup usage by Petr Baudiš

After talking with many speakers out of the meeting room, and chair a BOF of Linux Cgroup (control group, especially forcus on memory and I/O control), there were some non-linux-kernel talks abstracted me quite a lot. Though all the slides and video records can be found from internet (thanks to organization team again ^_^), I would like to share the talk by Thijs de Vries, which impressed me among many excellent talks.



Thijs de Vries: Gamification – using game elements and tactics in a non-game context

Thijs de Vries was from a game design company (correct me if I am wrong), in this talk he explained many design principles and practices in the company. He mentioned when they planed to design a game, there were 3 objects to considerate, which in turn were project, procedure and product. A project was built for the plan, a procedure was set during the project execution, a product was shipped as the output of the project. I do like this idea for design, it’s something new and helpful to me. Then he introduced how to make people have fun, involved into the game, and understand the knowledge from the game. In Thijs’ talk, it seems designing funny rules and goals is not difficult, but IMHO an educational game with funny rules and social goals is not easy to design even with every hard and careful effort. From his talk, I strongly felt innovation and genius of design (indeed not only game) from a different way which I never met and imagined before.

Beside orthodox conference talks, a lot conversation also happened outside the meeting room. Alexander Graf mentioned the effort to enable SUSE Linux on ARM boxes, which was a very interesting topic for people who looking for low power hardware like me. For some workload from Taobao, powerful x86 CPU does not help any more to performance, replacing them with low power ARM CPU may save lot of money on power and thermal expenditure. Currently the project seems going well, I hope the product may be shipped in the near future. Jiaju Zhang also introduced his proposal on a distributed clustering protocol which called Booth. We talked about the idea of Booth last year, it was good to see this idea came to a real project step by step. As a file system developer, some discussion about btrfs and OCFS2 happened with SuSE Labs people as well. For btrfs it was unanimous that this file system was not ready for large scale deployment yet, people from Fujitsu, Oracle, SuSE, Redhat, and other organizations were working hard to improve the quality to product usage. For OCFS2, we talked about file system freeze among cluster, there was little initial effort since last 2 years, a very incipient idea was discussed on how to freeze write I/O among each node in the cluster. It seems OCFS2 is in maintenance status currently, hope someday I (or someone else) have time and interest to work on this interesting and useful feature.
This article just part of my experience from openSUSE conference. OSC2012 was well organized, included but not limited to schedule, venue, video record, meal, travel, hotel, .etc. Here I should thank several people who help me to attend the great conference once again,

  • People behind, who accept my proposal
  • People behind, who kindly offer the sponsorship for my travel
  • Stella Rouzi, who helped me on visa application
  • Andreas Jaeger, Lars Muller, and other people who encourage me to give a talk on OSC2012.
  • Alexader Graf and others who review my slide

Finally, if you have interest to find more information about openSUSE conference 2012, these URL may be informative,

Conference schedule:
Conference video:
Slide of my talk:
Video of my talk:


October 16, 2010

China Linux Storage and File System Workshop 2010

Filed under: File System Magic,Great Days,kernel — colyli @ 12:41 pm

[CLSF 2010, Oct 14~15, Intel Zizhu Campus, Shanghai, China]

Similar to Linux Storage and File System Summit in north America, China Linux Storage and File System Workshop is a chance to make most of active upstream I/O related kernel developers get together and share their ideas and current status.

We (CLSF committee) invited around 26 persons to China LSF 2010, including community developers who contribute to Linux I/O subsystem, and engineers who develop their storage products/solutions based on Linux. In order to reduce travel cost to all attendees, we decided to co-locate China LSF with CLK (China Linux Kernel Developers Conference) in Shanghai.

This year, Intel OTC (Opensource Technology Center) contributed a lot to the conference organization. She kindly provided free and comfortable conference room, donated employees to help the organization and preparation, two intern students acted as volunteers helping on many trivial stuffs.

CLSF2010 is a two days’ conference,  here are some interesting topics (IMHO) which I’d like to share on my blog. I don’t understand very well on every topic, if there is any error/mistake in this text, please let me know. Any errata is welcome 🙂

— Writeback, led by Fengguang Wu

— CFQ, Block IO Controller & Write IO Controller, led by Jianfeng Gui, Fengguang Wu

— Btrfs, led by Coly Li

— SSD & Block Layer, led by Shaohua Li

— VFS Scalability, led by Tao Ma

— Kernel Tracing, led by Zefan Li

— Kernel Testing and Benchmarking, led by Alex Shi

Beside the above topics, we also had ‘From Industry’ sessions, engineers from Baidu, Taobao and EMC shared their experience when building their own storage solutions/products based on Linux.

In this blog, I’d like to share the information I got from CLSF 2010, hope it could be informative 😉

Write back

The first session started from Write back,  which is quite hot recently. Fengguang does quite a few work on it, and kindly volunteer to lead this session.

An idea was brought out to limit the dirty page ratio by per-process. Fengguang made a patch and shared a demo picture with us. When dirty pages exceeds the up-limit specified to a process, kernel will write back the dirty pages of this process smoothly, until the dirty page numbers reduced to a pre-configured rate. This idea is helpful to processes hold a large number of dirty pages.  Some people concerned this patch didn’t help the condition that a lot of processes and each hold a few dirty pages. Fengguang replied for server application, if this condition happened, the design might be buggy.

People also mentioned now the erase block size of SSD increased from KBs to MBs, adopting a bigger page numbers in writing out may help on the whole file system performance. Engineers from Baidu shared their experience,

— Increase the write out size from 4MB to 40MB, they achieved 20% performance improvement.

— Use extent based file system, they got better continuous on-disk layout and less memory consume for metadata.

Fengguang also shared his idea on how to control process to write pages, the original idea was control dirty pages by I/O (calling writeback_inode(dirtied * 3/2)), after several times improvement it became wait_for_writeback(dirteid/throttle_bandwidth). By this means, the I/O bandwidth of dirty pages to a process also got controlled.

During the discussion, Fengguang pointed out the event that a page got dirty was more important than whether a page was dirty. Engineers from Baidu said, in order to avoid a kernel/user space memory copy during file read/write, while using kernel page cache, they used mmap to read/write file pages other than calling read/write syscalls. In this case, a page writable in mmap is initialized as read only firstly, when the writing happened a page fault was triggered, then kernel knew this page got dirty.

It seems many ideas are under working to improve the writeback performance, including active writeback in back group, and some cooperation with underlying block layer. My current focus is not here, anyway I believe people in the room could help a bit out 🙂


Recently, there are many developers in China start to work on btrfs, e.g. Xie Miao, Zefan Li, Shaohua Li, Zheng Yan, … Therefore we specially arranged a two hours session for btrfs. The main purpose of the btrfs session is to share what we are doing on btrfs.

Most of people agreed that btrfs needed a real fsck tool now. Engineers from Fujitsu said they had a plan to invest people on btrfs checking tool development. Miao Xie, Zefan Li, Coly Li and other developers suggested to consider the pain of fsck from beginning,

— memory consuming

Now a 10TB+ storage media is cheap and common, for large file system built on them, doing fsck needs more memory to hold meta data (e.g. bitmap, dir blocks, inode blocks, btree internal blocks …). For online fsck, consuming too many memory in file system checking will have negative performance impact to page cache or other applications. For offline fack, it was not a problem, now online fsck is coming, we have to encounter this open question now 🙂

— fsck speed

A tree structured file system has (much) more meta data than a table structured file system (like Ext2/3/4), which may mean more I/O and more time. For a 10TB+ 80% full file system, how to reduce the file system checking time will be a key issue, especially for online service workload. I proposed an solution, allocating metadata to SSD or other higher seek speed device, then checking on metadata may have no (or a little) seeking time, which results a faster file system checking.

Weeks before, two intern students Kunshan Wang and Shaoyan Wang, they worked with me, wrote a very basic patch set (including kernel and user space code), to allocate metadata from a higher seek time device. This patch set is compiling passed, the students did a quite basic verification on meta data allocation, the patch worked. I don’t review the patch yet, by a quite rough code checking, there is much improvement needed. I post this draft patch set to China LSF mailing list, to call for more comments from CLSF attendees. Hope next month,  I can have time to improve the great job by Kunshan and Shaoyan.

Zefan Li said there was a todo list of btrfs, a long term task was data de-duplication, and a short term task was allocating data from SSD. Herbert Xu pointed out, the underlying storage media impacted file system performance quite a lot, from a benchmark from Ric Wheeler of Redhat, on Fusion IO high end PCI-E SSD, there is almost no performance difference between well known file system like xfs, ext2/3/4 or btrfs.

People also said that these days, the code review or merge of btrfs patches were often delayed, it seemed btrfs maintainer was too busy to handle the community patches. There was reply from the maintainer that the condition will be improved and patches would be handled in time, but there was no obvious improvement so far. I can understand when a person has more emergent task like kernel tree maintenance, he or she does have difficulty to handle non-trivial patches in time if this is not his or her highest priority job. From CLSF, I find more and more Chinese developers start to work on btrfs, I hope they should be patient if their patches don’t get handled in time 🙂

Engineers from Intel OTC mentioned there is no btrfs support from popular boot loader like Grub2. For me, IIRC there is someone working on it, and the patches are almost ready. Shaohua mentioned why not loading the Linux kernel by a linux kernel, like the kboot project does. People pointed out there still should be something to load the first Linux kernel, this was a chicken-and-egg question 🙂 My point was, it should not be very hard to enable the btrfs support in boot loader, a small Google Summer of Code project could make it. I’d like to port and merge the patches (if they are available) to openSUSE since I maintain openSuSE grub2 package.

Shaohua Li shared his experience on btrfs development for Meego project, he did some work on fast boot and read ahead on btrfs. Shaohua said there was some performance advance observed on btrfs, and the better result was achieved by some hacking, like a big read ahead size, a dedicated work queue to handle write request and using a big write back size. Fengguang Wu and Tao Ma pointed out this might be a general hacking, because Ext4 and OCFS2 also did the similar hacking for better performance.

Finally Shaohua Li pointed out there was a huge opportunity to improve the scalability of btrfs, since there still were many global locking, cache missing existing in current code.

SSD & Block Layer

This was a quite interesting session led by Shaohua Li. Shaohua started the session by some observed problems between SSD and block layer,

— Throughput is high, like network

— Disk controller gap, no MSI-x…

— Big locks, queue lock, scsi host lock, …

Shaohua shared some benchmark result showed that for high IOPS the interrupt over loaded on a single CPU,  even on a multi processors system, the interrupts could not be balanced to multi processors, which was a bottleneck to handle interrupts invoked by I/O of SSD.  If a system had 4 SDDs, a processor ran 100% to handle the interrupts and how throughput was around 60%-80%.

A workaround here was polling. Replacing interrupt by blk_iopoll could help the performance number, which could reduce processor overload on interrupts handling. However, Herbert Xu points out the key issue was current hardware didn’t support multi-queue to handle same interrupts. Different interrupts could be balanced to every processor in the system, but unlike network hardware, same interrupt could not be balanced into multi-queue and only be handled by a single processor. A hardware multi-queue support should be the silver bullet.

For SSD like Fusion IO produces, the IOPS could be one million + IOPS on a single SSD device, the parallel load is much more higher than on traditional hard disk. Herbert, Zefan and I agreed that some hidden race defect should be observed very soon.

Right now, block layer is not ready for such high parallel I/O load.  Herbert Xu pointed out that lock contention might be a big issue to solve. The source of the lock contention was cache consistence cost for global resource which protected by locking. Convert the global resource to a per-CPU local data might be a direction to solve the locking contention issue. Since Jens and Nick can touch Fusion IO devices more conveniently, we believe they can work with other developers to help out a lot.

Kernel Tracing

Zefan Li helped to lead an interesting session about kernel tracing. I don’t have any real understanding for any kernel trace infrastructure, for me the only tool is printk(). IMHO printk is the best trace/debug tool for kernel programming. Anyway, debugging is always an attractive topic to curious programmer, and I felt Zefan did his job quite well 🙂

The OCFS2 developer Tao Ma, mentioned OCFS2 currently using a printk wrapper trace code, which was not flexible and quite obsolete, OCFS2 developers were thinking of using a trace infrastructure like ftrace.

Zefan pointed out using ftrace to replace previous printk based trace messages should be careful, there might be ABI (application binary interface) issue for user space tools. Some user space tools work with kernel message (one can check kernel message with kmesg command). An Intel engineer mentioned there was accident recently that a kernel message modification caused the powertop tools didn’t work correctly.

For file system trace, the situation might be easier. Because most of the trace info was used by file system developers or testers, the one adding trace info into file system code might ignore the ABI issue with happy. Anyway, it was just “might”, not “be able to”.

Zefan said there was patch introduced TRACE_EVENT_ABI, if some trace info could form a stable user space ABI they could be announced by TRACE_EVENT_ABI.

This session also discussed how ftrace working. Now I know the trace info stored in a ring buffer. If ftrace is enabled but the ring buffer is not, user is still not able to receive trace info. People also said that a user space trace tool would be necessary.

Someone said perf tool currently getting more and more powerful, it was probably that integrating trace function into perf. Linux kernel only needs one trace tool,  some people in this workshop think it might be perf (for me, I have no point, because I use neither).

Finally Herbert again suggested people to pay attention on scalability issues when adding trace point. Currently the ring buffer was not a per-CPU local area, adding trace point might introduce performance regression for existing optimized code.

From Industry

In last year’s BeijingLSF, we invited two engineers from Lenovo. They shared their experience using Linux as the base system for their storage solution. This session had a quite positive feed back, and all committee member suggested to continue the From Industry sessions again this year.

For ChinaLSF2010, we invited 3 companies to share their ideas with other attendees, engineers from Baidu, Taobao and EMC  led three interesting sessions, people had chance to know which kind of difficulties they encountered, how they solved the problems and what they achieved from their solution or work around. Here I share some interesting points on my blog.

From Taobao

Engineers from Taobao also shared their works based on Linux storage and file systems,  the projects were Tair and TFS.

Tair is a distributed cache system used inside Taobao, TFS is a distributed user space file system to store Taobao goods pictures.  For detail information, please check 🙂

From EMC

Engineers from EMC shared their work on file system recovery, especially file system checking. Tao Ma and I, we also mentioned what we did in fsck.ocfs2 (ocfs2 file system checking tool). The opinion from EMC was, even an online file system checking was possible, the offline fsck was still required. Because an offline file system checking could check and fix a file system from a higher level scope.

Other points were also discussed in previous sessions, including memory occupation, time consuming …

From Baidu

This was the first time I knew people from Baidu, and had chance to knew what they did on Linux kernel. Thanks to Baidu kernel team, we had opportunity to know what they did in the past years.

Guangjun Xie from Baidu started the session by introducing Baidu’s I/O workload, most of the I/O were indexing and distributed computing related, reading performance was more desired then writing performance. In order to reduce memory copying in data reading, they used mmap to read data pages from underlying media to page cache.  Accessing the page via mmap could not use the advantage of Linux kernel page cache replacement algorithm, while Baidu didn’t want to implement a similar page cache within user space. Therefore they used a not-beautiful-but-efficient workaround, they implemented an in-house system call, the system call updated the page (returned by mmap) in kernel’s page LRU. By this means, the data page could be management by kernel’s page cache code. Some people pointed out this was mmap() + read ahead. From Baidu’s benchmark their effort increased 100% searching workload performance on a single node server.

Baidu also tried to use bigger block size of Ext2 file system, to make data block layout more continuous, also from their performance data the bigger block size also resulted a better I/O performance. IMHO, a local mod ocfs2 file system may achieve a similar performance, because the basic block unit of ocfs2 is a cluster, the cluster size could be from 4KB to 1MB.

Baidu also tried to compress/decompress the data when writing/reading from disk, since most of Baidu’s data was text, the compress rate was quite satisfied high. They even used a PCIE compressing card, the performance result was pretty good.

Guangjun also mentioned, when they used SATA disks, some I/O error was silence error, for meta data, this was a fatal error, at least meta data checksum was necessary. For data checksum, they did it in application level.


Now comes to the last part of this blog, let me give my own conclusion to ChinaLSF 2010 🙂

IMHO, the organization and preparation this year is much better than BeijingLSF 2009, people from Intel Shanghai OTC contribute a lot of time and effort before/during/after the workshop, without their effort, we can not have such a successful event. Also a big thank you should give our sponsor EMC China, they not only sponsor conference expense, but also send engineers to share their development experience.

Let’s wait for next year for ChinaLSF 2011 🙂

June 30, 2010

Taobao joins open source

Filed under: Great Days — colyli @ 10:27 am

Taobao's open source commnity

Taobao's open source community

Today, Taobao announces its open source community  —

This is a historical day, a China local  internet and e-business leading company joins open source world by its practice approved activity.

The first project released on is TAIR. Tair is a distributed, high performance key/value storage system, using in Taobao’s infrastructure for time.  Taobao is on the way to make more internal projects to be open source. Yes, talk is cheap, show the code !

If you are working on large scale website, with more than 10K server nodes, checking projects on may help you to avoid making another wheel. Please, visit, and join the community to contribute. I believe people can improve the community better and better. Currently, most of the expect developers are Chinese spoken, that’s why you can find square characters on the website. I believe more changes will come in future, because the people behind the community like continuously improvement 🙂

Of cause there are some other contributions to open source community from Taobao can not be found on, For example, I believer patches from Taobao will appear in Linux kernel changelog very soon 🙂

January 4, 2010

2010 first snow in Beijing

Filed under: Great Days — Tags: — colyli @ 2:03 am

Yesterday, the 2010 first snow visited Beijing. I stayed in home till midnight, then went out to take some photos.

The air was so cold, I walked in the frozen wind for 1.5 hours.  It was fun to see the snow covered every where, especially the houses, cars, and plants. Several fat cats appeared on my way without glancing on me. I guesses they were looking for some warm place to stay, wish they felt comfortable last night and still be okey this morning. It’s probably that cats are stronger than me, after last night’s walk and even stayed in a warm room, I am afraid I’ve caught a chill 🙁

In China, it was a perfect sign for a big snow in beginning of year 2010.  Maybe this is another excited and impressive new year, if we are more diligent and optimistic, who knows ? 🙂

October 27, 2009

BeijingLSF 2009

Filed under: File System Magic,Great Days,kernel — colyli @ 12:31 pm

In the passed several months, I was helping to organize BeijingLSF(Beijing Linux Storage and File System Workshop) with other kernel developers in China. This great event happened on Oct 24, here is my report.

Since early this year, budget control happens in almost all companies/organizations, many local kernel developers in China could not attend LSF in United States (it’s quite interesting that most of kernel developers in China are storage related, mm, cgroup, io controller, fs, device mapper …). In such condition, we found there were enough people inside China to sit together for discussion on storage and file system related topics. In 2009 April, a proposal was posted on for BeijingLSF. Many people provided positive feed back, then we started to organize this event.

A 7 persons’ committee was set up firstly, people were from Novell, Intel, Oracle, Redhat, Freescale. The committee made a 20 persons’ invitation list. The website was built on and all invitees registered. Fortunately we got sponsorship from Novell China for soft drink and snack.

There were 6 sessions for BeijingLSF 2009. There was no talk, people just sat together to discuss on specific topics.

The first session was distributed locker manager. I led the session, the discussion included,

– introduced back ground of dlm, and current issues from fs/dlm (most of the issues were from my closed-or-opened and open-to-community BNCs).

– Oracle ocfs2 developers explained why ocfs2 using 64bytes lock value block.

– Jiaju Zhang (Novell) explained his patches for dlm performance improvement.

– Tao Ma (Oracle), Jiaju Zhang (Novell), Xinwei Hu (Novell) and I, discussed how dlm working with ocfs2’s user mode cluster stack.

The second session was clustering file system, led by Tao Ma (Oracle). Tao suggested people to introduce each other before the session. During the introduction, discussion happened when people introduced their current projects. When the discussion finished, 40 minutes passed. The workshop had no introduction time planed, therefore most time of this session was used for people to know each other. IMHO it was worth. This was the first time almost all storage and file system developers in China sat together, people came to know the faces behind email addresses.

The last session in the morning was shared storage and snapshot, led by Xinwei Hu (Novell). Xinwei introduced how logical volume management working in clustering environment, then discussion happened on,

– Considering snapshot start to happen on file system level, snapshot by device mapper might be less and less important in future.

– Is it possible to support snapshot by lvm in clustering environment and is it worth ? There was no conclusion from the discussion, and I’d like to hear from device mapper developers 🙂

After the lunch, the first session in the afternoon was VFS readahead and writeback. The session was led by Fengguang Wu (Intel), a 6 pages slide made people discuss for 90 minutes. Wu spent 20 minutes to introduce his patch, then people talked about,

– Why anonymous pages and file pages should be treated differently.

– In order to improvement writeback performance, MM should be able to suspend (maybe there is some better term) a process who making too many dirty pages.

– On some kind of SSD, linear read/write was slower than discrete read/write. If the storage media was SSD, there might be little difference for the writeback policy.

Second session in the afternoon was I/O controller and I/O bandwidth. The session was led by Jiangfeng Gui (Fujitsu). This talk was quite new to most of the attendees. Jianfeng explained the conception of io controller very well, at least I understood it was a software conception, not a haredware 🙂 io controller was an interesting idea, but most of concern in the workshop was focused on its complexity.

The last session of the workshop was interaction with industry. We invited an engineer from Lenovo – Yilei Lu, who was working on a internet storage solution based on Linux operating system. Yilei introduced how they used Linux as base system in their storage solution, what problems or difficulties they encountered. Many people provided suggestions to the questions, and most of the developers were very happen to hear feed back from the development users.

After all the six sessions, there was light talks. Almost all attendees said this workshop was the first effort to make upstream active developers to sit together in China. Some people showed their willing to sponsor BeijingLSF next year (if there is), some people said they could help to organize similar events in their cities. IMHO, BeijingLSF is a great and successful event. The most important thing is even not discussion, this is the *first* time for *ALL* (yes ALL) most active storage related developers within China to see each other, and have chance to talk face to face. Unlike LSF in United States, BeijingLSF has little effect to Linux storage and file system development, anyway it’s a great effort to make discussion and face-to-face communication happen.

Novell acts a very important role and contributes quite a lot to BeijingLSF. I was able to use the ITO (what a great idea!) to help organize BeijingLSF, and Novell China sponsored soft drink and snack to make all attendees more comfortable while talking whole day.

Finally please permit me to thank all attendees, they are,

Bo Yang, Coly Li, Fengguang Wu, Herbert Xu, Jeff He, Jeff Liu, Jiaju Zhang, Jianfeng Gui, Michael Fu, Tao Ma, Tiger Yang, Xiao Luo, Xinwei Hu, Xu Wang, Yang Li, Yawei Niu, Yilei Lu, Yu Zhong, Zefan Li, Zheng Yan.

Your coming and participating make BeijingLSF being a great and successful event.

Beijing Linux Storage and File System Workshop 2009

Beijing Linux Storage and File System Workshop 2009

[If you are interested on how the attendees look alike, please check]

October 16, 2009

My first publish

Filed under: Basic Knowledge,Great Days — colyli @ 11:53 am

After publish the Chinese translation of “Linkers and Loaders” on , this week the Chinese version of this great book is published. This is my first publish, though it’s a translation 🙂


If anyone finds any mistake from the translation, please send the errata to publisher or to me directly. I do appreicate for your feed back 🙂

[NOTE: the cover picture is copied from ]

September 22, 2009

The wonderful openSUSE Conference 2009

Filed under: Great Days — colyli @ 1:07 pm


In September 16-20, I was in Nuremberg Germany for openSUSE Conference 2009.

In previous years, Labs member attended SuSE Labs conference. This year, the Labs conference was cancelled and we were encouraged to attend openSUSE Conference. IMHO, more investment on openSUSE community is a great idea, we need more hands from community.

I was invited to give a talk, the topic was about the open source development activities among a group of Chinese university students in Beijing. In the past 4 years, a group of university students in Beijing University of Post and Telecommunication contribute quite a few to open source community. In my talk, I introduced how the people were grouped and how the technical seminars were organized, of cause including the Google Summer of Code projects in the past 2 years. The slide file and video of my talk can be found on the internet [1].

OSC09 was a great chance to meet other community members, especially some ones did excellent contributions but never met before. For example, I spend 4~5 happy hours every month to read the openSUSE Weekly News. The News content is well organized and prepared, especially the community news and people of openSUSE. During the conference. One day I took Tram to conference venue from SUSE office, on the Tram I met a very nice guy, wore a black openSUSE T-shirt. We talked about the openSUSE community and the Weekly News, I was supprised to know him as a Weekly News editor after his self introduction, I never though I had opportunity to meet these cool guys face to face! This was the wonderful side effect of this conference, and collaboration happened. I decided to send text to him when I had valuable News, therefore I remembered his name — Sascha Manns.

This year I acted as mentor of a Google Summer of Code project, to guide a student to port openSUSE to MIPS platform (I will mention the project detail in another blog). During the conference, I met another Google Summer of Code group (Jan-Simon Möller, Martin Mohring, Adrian Schröter) who ported openSUSE to ARM platform. The openSUSE ARM porting student gave a talk on the conference, to introduce their job. I got very helpful information from his talk, and from the discussion after the talk. When we ported openSUSE to MIPS, we used system mode QEMU as the target MIPS hardware for RPM package building, which was every slow. Building GCC even spent around 5 days! The ARM porting team used a very smart method, they used user mode QEMU. The user mode QEMU is able to run a normal program on x86 machine which compiled for ARM processor, without emulating the whole system, compiling GCC just spends 3~4 hours. Right now QEMU does not support 64bit user mode, before integrating MIPS support to OBS (OpenSUSE Build Service), enabling the 64bit MIPS user mode support for QEMU might be our next target.

Since 2008, I know there is a community board of openSUSE (the board definitely exists earlier). On the second day here was a session to meet openSUSE board. Before the session, the board members were only symbols/strings/names to me. This time I knew they were 6 people (IIRC, why not 5 or 7 seats?) and who they were. The difficulty for me was, I could remember their faces, but was not able to pronounce their names correctly. In the Q&A time, I suggested to mark the pronunciation of board members on openSUSE website. This might not be a good idea, but really helpful for non native English (or other language) speakers to identify the community board members.

On the last day, there was an interesting session — ‘openSUSE Legal’. In this session, Jürgen Weigert and other people explained how to comp with software patent, different software release license. In the past years, I tried to assemble source code from MINIX (BSD like license), Linux Kernel (GPL2)), uClibc (GLGPL) into a hobby OS. I asked a question on how to do with this condition. The answer was quite clear, 1) If I could get announcement from the code authors to use a unified license, the unified license could be used. Otherwise 2) declare different license for different code. I need to find quite a lot time to declare different license to MLXOS [2].

Besides the above topics, I also attended some other very interesting sessions, e.g. Tackling a Buggy Kernel by Nikanth Karthikesan, Making Technology Previews Succeed by Suresh Jayaraman, openSUSE & Moblin by Michael Meeks, Visualizing Package Dependencies by Klaus Kampf, Git in the Build Service by Andreas Gruenbacher, Samba by Lars Müller … The final light talk was also impressive to me, especially the awesome Baconn 🙂

For openSUSE community, the OSC09 is a great event, we have chance to do face to face communication, which is very helpful to people work closer. Thanks to Novell for the conference sponsorship, wish all people continue to enjoy the community.

[1] Slide file and video of my talk:
[2] MLXOS website:
[3] Conference schedule
[4] Pictures of OSC09

August 4, 2009

Report for the Course of Dragon Star Programming

Filed under: Basic Knowledge,File System Magic,Great Days — colyli @ 12:42 pm

In July 13~17, I took a course called ‘Dragon Star’ programming. Here is my meeting report of this great event.

At the very beginning, please permit me to thank my Labs colleagues Nikanth Karthikesan and Suresh Jayaraman for their kindly help. They review the draft version of this report during their busy working hours, provide many valuable comments and make the final version’s quality much better.

Dragon Star is a series of computer science academic communications. It’s sponsored by China Natural Science Fund, to invite outstanding overseas Chinese professor to give systematic training for post graduate students. Dragon Star programming office is located at Institute of Computing Technology of Chinese Academy of Science.

In the past several years, most attendees of this program are university students, researcher from state owned institute or organization. Due to the limited seats for each course, this program seldom accepts applications from multi-national companies (like Novell). This year, it’s quite surprising that my application was approved, therefore I have a chance to know people from state owned institute/organization and exchange ideas with professor and students.

The course I applied was “Infrastructure of data-binded applications”, which was taught by Professor Xiaodong Zhang from Ohio State University. Frankly speaking, it was a seminar rather than a course, open discussions happened freely, professor was willing to hear and discuss with students. Though it was a course for post graduate students, most of the audience were PhD student, researcher and professor from universities. The venue was in Changsha Institute of Technology, where Changsha is another city in China, famous for its crazy high temperature in summer. I was surprised to know that many students in this university stayed in school during the summer holiday and also joined the training.

In the 4.5 days course, Prof Zhang touched quite a lot layers of the whole computer storage stack, from CPU L1/2/3 caches, to main memory, swapping in virtual memory and buffer cache for disk data, and finally SSD. Though many fields were mentioned, the discussion was always focused on one topic — caching. As an file systems engineer, caching is something I should know and I must know. At the end of the course, I did feel that the content of this course was much more beyond my expectation.

Here I share some of my experiences from this course.

– Most of my computer knowledge was learned by myself, for CPU caches organization, I had a misunderstood concept of fully associative cache and direct mapped cache for long. I thought, fully associative cache was much faster than direct mapped cache, because it could compare the cached value in parallel. But full associated cache was too expensive, therefore direct mapped cache was introduced. From this training, I realized direct mapped cache is much faster than full associated cache. The short coming of direct mapped cache was cache conflict handling, which is just the advantage of fully associative cache. Even a 2 way set-assocated cache can improve a lot on cache conflict because it can cache different values in these 2 ways. Then I learned a new (to me) idea to improve cache look-up speed in set-associative cache [1].

– In this training, I also re-learned the conception of row buffer in memory controller. The professor told us they observed an interesting condition. Because the row buffer only cache one page size data, when there was a cache conflict in L2/3 cache, there should be a buffer conflict in row buffer. The conflict means, new content had to replace the old content in same place. Then he introduced their work on how to solve this issue [2]. The solution was quite simple, but how they observed this issue, and how they made the analysis, this was a perfect story.

– As a Linux kernel developer, 2 talks by this professor helped me to understand the Linux virtual memory management a little bit easier. One was a method to improve page replacement policy when cache memory is full, called token-ordered LRU, which could minimize the possibility to paging thrashing [3]. Another was to avoid caching too many access-once-only pages in cache memory, which was called clock-pro[4]. I knew the token-ordered LRU for days, but this was the first time to meet one of the algorithm authors. For clock-pro, from google I knew there was patches and not upstream yet. If you are interested on these topics, please check the reference.

– Prof Zhang’s team also did research on SSD (Solid State Disk) caching. They ran different I/O pattern and got some interesting performance numbers. One of the performance numbers impressed me was, random reads and sequential reads had recoganized performance difference on Intel X25-E SSD. In the past 2 years, I took it for granted that reading from SSD were all in same speed. The explanation here was that, there was read-ahead buffer inside the SSD, therefore sequential reads is faster than random reads. Caching is everywhere!

When I stayed in Changsha, the temperature was 40+ degree Celsius. Everyday, I walked 6km between dorm and classroom (a big university) in the sunshine and high temperature, my T-shirt was wet all the time. In the evening, it was good time to read related papers and review the topics which Prof Zhang mentioned in day time. I had a feeling that if I learned these topics myself, I would have spent 3-6 months more.

The course was divided into several talks, here I list all the talks in the order of when they were taken. For details of each talk, if you are interested, please check the reference, you can find papers and slide files there. The only pitty is, that the slide is made by MS office, might not be perfectly compatible with OpenOffice.

Day 1
The top half day was overview, topic was “Balancing System Resource Supply and Demand for Effective Computing”.
The bottom half days we discussed processor cache design principles, included,
– Basic logical structure of CPU cache
– The trade-off between hit rate and access delay
– Cache design for high hit rate with low access delay
Today, the new word for me was multi-column cache. Since most of the architecture book (I read) stops at N-way set-assocaited cache, this was the first time I knew multi-column cache.

Day 2
The first half of the day started with last day’s topic, cache management in multi-core system. Prof Zhang introduced the page coloring scheme, to assign different color to different page mapping to different cache page, and pages mapping to same cache page have same color. Pages with same color are grouped into bucket. If a task is cache consumed in run time, the kernel can allocate more buckets for it. Though I didn’t understand how the buckets can cope with DMA (I guess for physical address mapped cache, DMA transfer might invalidate some content in L1/2 cache), the benchmark number shows quite some improvement when some applications are assigned with more page buckets. Details can be found from [4]. Continuing on this idea, another work was introduced, it was a database prototype called MCC-DB on LCC (last level cache) shared machine. MCC-DB intends to analyze the cache consumption of each query task, and queue the taks with knowledge of cache usage prediction. The benchmark number says this scheme can reduce query execution times by up to 33% [5].
The second half of the day introduced the structure of DRAM row buffer, and explained why when L2 cache conflict happens, row buffer would always conflict (without the improved algorithm). Then Prof Zhang introduced an algorithm to keep the locality in DRAM and avoid row buffer conflict. The algorithm was surprisingly simple, just XOR 2 parts of a physical address. The story how they found the row buffer conflict, and how to analyze the source issue, was quite interesting. Details can be found from [2].

Day 3
The first half of the day started with disk cache replacement algorithm. Introduced LRU, and analyzed the disadvantage of LRU. A major issue was, too many accessed-once-only MRU data will flush out accessed-many-times LRU data. Then Prof Zhang introduced an algorithm called LIRS( Low Inter-reference Recency Set). The basic idea of LIRS is to make multi-level queues LRU. When a page is cached, just keep it in low level queue, only when it is accessed more than once, move it into high level queue. The cache pages in low level queue are the first choice to be replaced out. The whole algorithm is (IMHO) quite complex, and I can not tell all the details without reading the paper [6] again. Though LIRS is cool, it can not replace current improved LRU implementation in Linux kernel for page cache replacement. The key issue is, in LIRS, a quite complex queue insert/remove/move operation needs a big lock to protect involved stacks, which will be a big performance penalty in kernel space. Therefore, currently LIRS is used in user space, for example postgres (if I remember correctly).
The second half of the day continued with the page cache replacement algorithm. Prof Zhang introduced an simplified algorithm called Clock-Pro [7], which was an improvement based on Clock page replacement algorithm. Right now, there is patch that implements Clock-Pro, but not upstream yet. After Clock-Pro, Prof Zhang introduced another algorithm to avoid cache thrashing which was called Token-orderd LRU[3]. Basic idea of Token-ordered LRU is, when page cache is full, new data read-in has to replace existed pages. In order to avoid thrashing, there is a token assigned to a selected process. When replacing cache pages, do not replace pages of the process who has the token. This scheme can make sure the process who has token can execute faster and finish early to release more pages. Token-ordered LRU is in upstream Linux kernel, the implementation is around 50 lines C code, perfect small.

Day 4
The first half of the day again started with disk cache replacement. For harddisk, due to seeking and read ahead buffering, reading continuous data is much faster than reading random data. Therefore, caching random data in memory might have less performance penalty than caching continuous data when page fault happens. Because 2 LBA continous blocks does not mean they are adjacent in physical layout on hard disk. Prof Zhang introduced how they make judgement on whether LBA continous blocks were physically adjacent on disk. The basic idea is, tracking last 2 access time of each cached block. If two LBA continous blocks are read-in within a small interval, these 2 blocks are physically continous on harddisk. Then when buffer cache gets full, replace the physically continous blocks first [8]. IMHO, the drawback of this algorithm is, there has to be a quite big tree structure inside kernel space to trace access time of each block. And the timestamp is made by reading TSC, I am not sure whether there is performance penalty to read TSC on per-block-reading.
The second half of the day was all for SSD. Prof Zhang spent much time to introduce the basic structure of SSD, and their work in cooperation with Intel. They did quite a lot benchmark on Intel SSD (X25-E IIRC), and observed interesting performance numbers[9]. One of the numbers impressed me was, continuous reads is faster than random reads. I asked Prof Zhang, a dirty block should be erased before next writing, was it a nature feature to implement a copy-on-write file system (though the erase block size is some how bigger than file system block size). There was no explicit answer of this question.

Day 5
Today we discussed data transfer and cache managerment on internet, especially on P2P data sharing. The Professor introduced their research to explain why P2P is the most efficient method to transfer multimedia data on internet. The basic idea is multimedia information transfer on internet is not a zipf-like distribution[10], a topology (of users’ location) optimized P2P network can cache the most accessed multimedia data in the users’ local network , which can maximize data transfer bandwidth and speed for multimedia data on internet[11].
This was the last topic of the whole course. After this talk, there was a summary speech by Dragon Star program organizers. Then the 5 days course concluded.

From the above description, though this course touched many fields from CPU cache to internet P2P protocol, there is only one thread: improve data access performance by caching. IMHO, there are 3 methods to improve system I/O performance, cache, duplicated and prefetch. This course provides a quite systematic introduction on the first solution — caching. To design a cache system (no matter where it is located in the storage stack), some key issues should be considered: cache lookup, cache conflict, and cache replacement. This course shares the experiences of a group of people (Prof Zhang’s team) on how to improve overall system performance against these issues. As a file system developer, all the content of this course perfectly hit my professional area, which brings me many new conceptions and provides me a chance to learn how experienced people think and solve problem.

Prof Zhang, please accept my sincerely regards, for your great job in these days. You were so kind to share your knowledge with us, worked so hard even with 40C+ high temperature. Wish we can meet somewhere sometime again … …
Finally I want to thank the dragon star program office to organize such a successful course, so I have chance to know different people and their great jobs. Also I should thank my employer, a great company, to send me to Changsha, give me an opportunity to have such an excited experience.

All slide files of this course can be found from

[1] C. Zhang, X. Zhang, and Y. Yan, “Two fast and high-associativity cache schemes”, IEEE Micro, Vol. 17, No. 5, September/October, 1997, pp. 40-49.
[2] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality”, Proceedings of the 33rd Annual International Symposium on Microarchitecture, (Micro-33), Monterey, California, December 10-13, 2000. pp. 32-41.
[3] Song Jiang and Xiaodong Zhang, “Token-ordered LRU: an effective page replacement policy and its implementation in Linux systems”, Performance Evaluation, Vol. 60, Issue 1-4, 2005, pp. 5-29.
[4] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan, “Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems”, Proceedings of of the 14th International Symposium on High Performance Computer Architecture (HPCA’08), Salt Lake City, Utah, February 16-20, 2008.
[5] Rubao Lee, Xiaoning Ding, Feng Chen, Qingda Lu, and  Xiaodong Zhang , “MCC-DB: minimizing cache conflicts in muli-core processors for databases”, Proceedings of 35th International Conference on Very Large Data Bases, (VLDB 2009), Lyon, France, August 24-28, 2009.
[6] Song Jiang and Xiaodong Zhang, LIRS: an efficient low inter-reference recency set replacement to improve buffer cache performance, Proceedings of the 2002 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, (SIIMETRICS’02), Marina Del Rey, California, June 15-19, 2002.
[7] Song Jiang, Feng Chen, and Xiaodong Zhang, “CLOCK-Pro: an effective improvement of the CLOCK replacement”, Proceedings of 2005 USENIX Annual Technical Conference (USENIX’05), Anaheim, CA, April 10-15, 2005.
[8] Xiaoning Ding, Song Jiang, Feng Chen, Kei Davis, and Xiaodong Zhang, “DiskSeen: exploiting disk layout and access history to enhance I/O prefetch”, Proceedings of the 2007 USENIX Annual Technical Conference, (USENIX’07), Santa Clara, California, June 17-22, 2007.
[9] Feng Chen, David Koufaty, and Xiaodong Zhang, “Understanding intrinsic characteristics and system implications of flash memory based solid state drives”, Proceedings of 2009 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems}, (SIGMETRICS/Performance 2009), Seattle, WA, June 15-19, 2009.
[10] Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang, “Does Internet media traffic really follow Zipf-like distribution?”, Proceedings of ACM SIGMETRICS’07 Conference, (Extended Abstract), San Diego, California, June 12-16, 2007.
[11] Lei Guo, Songqing Chen, and Xiaodong Zhang, “Design and evaluation of a scalable and reliable P2P assisted proxy for on-demand streaming media delivery”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 5, 2006, pp. 669-682.

Older Posts »

Powered by WordPress