An enjoyed kernel apprentice Just another WordPress weblog

September 22, 2009

The wonderful openSUSE Conference 2009

Filed under: Great Days — colyli @ 1:07 pm


In September 16-20, I was in Nuremberg Germany for openSUSE Conference 2009.

In previous years, Labs member attended SuSE Labs conference. This year, the Labs conference was cancelled and we were encouraged to attend openSUSE Conference. IMHO, more investment on openSUSE community is a great idea, we need more hands from community.

I was invited to give a talk, the topic was about the open source development activities among a group of Chinese university students in Beijing. In the past 4 years, a group of university students in Beijing University of Post and Telecommunication contribute quite a few to open source community. In my talk, I introduced how the people were grouped and how the technical seminars were organized, of cause including the Google Summer of Code projects in the past 2 years. The slide file and video of my talk can be found on the internet [1].

OSC09 was a great chance to meet other community members, especially some ones did excellent contributions but never met before. For example, I spend 4~5 happy hours every month to read the openSUSE Weekly News. The News content is well organized and prepared, especially the community news and people of openSUSE. During the conference. One day I took Tram to conference venue from SUSE office, on the Tram I met a very nice guy, wore a black openSUSE T-shirt. We talked about the openSUSE community and the Weekly News, I was supprised to know him as a Weekly News editor after his self introduction, I never though I had opportunity to meet these cool guys face to face! This was the wonderful side effect of this conference, and collaboration happened. I decided to send text to him when I had valuable News, therefore I remembered his name — Sascha Manns.

This year I acted as mentor of a Google Summer of Code project, to guide a student to port openSUSE to MIPS platform (I will mention the project detail in another blog). During the conference, I met another Google Summer of Code group (Jan-Simon Möller, Martin Mohring, Adrian Schröter) who ported openSUSE to ARM platform. The openSUSE ARM porting student gave a talk on the conference, to introduce their job. I got very helpful information from his talk, and from the discussion after the talk. When we ported openSUSE to MIPS, we used system mode QEMU as the target MIPS hardware for RPM package building, which was every slow. Building GCC even spent around 5 days! The ARM porting team used a very smart method, they used user mode QEMU. The user mode QEMU is able to run a normal program on x86 machine which compiled for ARM processor, without emulating the whole system, compiling GCC just spends 3~4 hours. Right now QEMU does not support 64bit user mode, before integrating MIPS support to OBS (OpenSUSE Build Service), enabling the 64bit MIPS user mode support for QEMU might be our next target.

Since 2008, I know there is a community board of openSUSE (the board definitely exists earlier). On the second day here was a session to meet openSUSE board. Before the session, the board members were only symbols/strings/names to me. This time I knew they were 6 people (IIRC, why not 5 or 7 seats?) and who they were. The difficulty for me was, I could remember their faces, but was not able to pronounce their names correctly. In the Q&A time, I suggested to mark the pronunciation of board members on openSUSE website. This might not be a good idea, but really helpful for non native English (or other language) speakers to identify the community board members.

On the last day, there was an interesting session — ‘openSUSE Legal’. In this session, Jürgen Weigert and other people explained how to comp with software patent, different software release license. In the past years, I tried to assemble source code from MINIX (BSD like license), Linux Kernel (GPL2)), uClibc (GLGPL) into a hobby OS. I asked a question on how to do with this condition. The answer was quite clear, 1) If I could get announcement from the code authors to use a unified license, the unified license could be used. Otherwise 2) declare different license for different code. I need to find quite a lot time to declare different license to MLXOS [2].

Besides the above topics, I also attended some other very interesting sessions, e.g. Tackling a Buggy Kernel by Nikanth Karthikesan, Making Technology Previews Succeed by Suresh Jayaraman, openSUSE & Moblin by Michael Meeks, Visualizing Package Dependencies by Klaus Kampf, Git in the Build Service by Andreas Gruenbacher, Samba by Lars Müller … The final light talk was also impressive to me, especially the awesome Baconn 🙂

For openSUSE community, the OSC09 is a great event, we have chance to do face to face communication, which is very helpful to people work closer. Thanks to Novell for the conference sponsorship, wish all people continue to enjoy the community.

[1] Slide file and video of my talk:
[2] MLXOS website:
[3] Conference schedule
[4] Pictures of OSC09

September 10, 2009

add extra mount option when mount ocfs2 volume by RA

Filed under: File System Magic — colyli @ 10:16 am

Just today, I know how to add extra mount options (like journal mod, or acl, or ….) when mount ocfs2 volume by pacemaker resource agent.

Type ‘crm ra info Filesystem ocf’, here is the output:

Filesystem resource agent (ocf:heartbeat:Filesystem)

Resource script for Filesystem. It manages a Filesystem on a shared storage medium.

Parameters (* denotes required, [] the default):

device* (string): block device
The name of block device for the filesystem, or -U, -L options for mount, or NFS mount specification.

directory* (string): mount point
The mount point for the filesystem.

fstype* (string): filesystem type
The optional type of filesystem to be mounted.

options (string)
Any extra options to be given as -o options to mount.

For bind mounts, add “bind” here and set fstype to “none”.
We will do the right thing for options such as “bind,ro”.

Operations’ defaults (advisory minimum):

start         timeout=60
stop          timeout=60
notify        timeout=60
monitor_0     interval=20 timeout=40 start-delay=0

Now I know add a options=”” key words in the cib configuration can make it. Thanks to Dejan Muhamedagic 🙂

August 4, 2009

Report for the Course of Dragon Star Programming

Filed under: Basic Knowledge,File System Magic,Great Days — colyli @ 12:42 pm

In July 13~17, I took a course called ‘Dragon Star’ programming. Here is my meeting report of this great event.

At the very beginning, please permit me to thank my Labs colleagues Nikanth Karthikesan and Suresh Jayaraman for their kindly help. They review the draft version of this report during their busy working hours, provide many valuable comments and make the final version’s quality much better.

Dragon Star is a series of computer science academic communications. It’s sponsored by China Natural Science Fund, to invite outstanding overseas Chinese professor to give systematic training for post graduate students. Dragon Star programming office is located at Institute of Computing Technology of Chinese Academy of Science.

In the past several years, most attendees of this program are university students, researcher from state owned institute or organization. Due to the limited seats for each course, this program seldom accepts applications from multi-national companies (like Novell). This year, it’s quite surprising that my application was approved, therefore I have a chance to know people from state owned institute/organization and exchange ideas with professor and students.

The course I applied was “Infrastructure of data-binded applications”, which was taught by Professor Xiaodong Zhang from Ohio State University. Frankly speaking, it was a seminar rather than a course, open discussions happened freely, professor was willing to hear and discuss with students. Though it was a course for post graduate students, most of the audience were PhD student, researcher and professor from universities. The venue was in Changsha Institute of Technology, where Changsha is another city in China, famous for its crazy high temperature in summer. I was surprised to know that many students in this university stayed in school during the summer holiday and also joined the training.

In the 4.5 days course, Prof Zhang touched quite a lot layers of the whole computer storage stack, from CPU L1/2/3 caches, to main memory, swapping in virtual memory and buffer cache for disk data, and finally SSD. Though many fields were mentioned, the discussion was always focused on one topic — caching. As an file systems engineer, caching is something I should know and I must know. At the end of the course, I did feel that the content of this course was much more beyond my expectation.

Here I share some of my experiences from this course.

– Most of my computer knowledge was learned by myself, for CPU caches organization, I had a misunderstood concept of fully associative cache and direct mapped cache for long. I thought, fully associative cache was much faster than direct mapped cache, because it could compare the cached value in parallel. But full associated cache was too expensive, therefore direct mapped cache was introduced. From this training, I realized direct mapped cache is much faster than full associated cache. The short coming of direct mapped cache was cache conflict handling, which is just the advantage of fully associative cache. Even a 2 way set-assocated cache can improve a lot on cache conflict because it can cache different values in these 2 ways. Then I learned a new (to me) idea to improve cache look-up speed in set-associative cache [1].

– In this training, I also re-learned the conception of row buffer in memory controller. The professor told us they observed an interesting condition. Because the row buffer only cache one page size data, when there was a cache conflict in L2/3 cache, there should be a buffer conflict in row buffer. The conflict means, new content had to replace the old content in same place. Then he introduced their work on how to solve this issue [2]. The solution was quite simple, but how they observed this issue, and how they made the analysis, this was a perfect story.

– As a Linux kernel developer, 2 talks by this professor helped me to understand the Linux virtual memory management a little bit easier. One was a method to improve page replacement policy when cache memory is full, called token-ordered LRU, which could minimize the possibility to paging thrashing [3]. Another was to avoid caching too many access-once-only pages in cache memory, which was called clock-pro[4]. I knew the token-ordered LRU for days, but this was the first time to meet one of the algorithm authors. For clock-pro, from google I knew there was patches and not upstream yet. If you are interested on these topics, please check the reference.

– Prof Zhang’s team also did research on SSD (Solid State Disk) caching. They ran different I/O pattern and got some interesting performance numbers. One of the performance numbers impressed me was, random reads and sequential reads had recoganized performance difference on Intel X25-E SSD. In the past 2 years, I took it for granted that reading from SSD were all in same speed. The explanation here was that, there was read-ahead buffer inside the SSD, therefore sequential reads is faster than random reads. Caching is everywhere!

When I stayed in Changsha, the temperature was 40+ degree Celsius. Everyday, I walked 6km between dorm and classroom (a big university) in the sunshine and high temperature, my T-shirt was wet all the time. In the evening, it was good time to read related papers and review the topics which Prof Zhang mentioned in day time. I had a feeling that if I learned these topics myself, I would have spent 3-6 months more.

The course was divided into several talks, here I list all the talks in the order of when they were taken. For details of each talk, if you are interested, please check the reference, you can find papers and slide files there. The only pitty is, that the slide is made by MS office, might not be perfectly compatible with OpenOffice.

Day 1
The top half day was overview, topic was “Balancing System Resource Supply and Demand for Effective Computing”.
The bottom half days we discussed processor cache design principles, included,
– Basic logical structure of CPU cache
– The trade-off between hit rate and access delay
– Cache design for high hit rate with low access delay
Today, the new word for me was multi-column cache. Since most of the architecture book (I read) stops at N-way set-assocaited cache, this was the first time I knew multi-column cache.

Day 2
The first half of the day started with last day’s topic, cache management in multi-core system. Prof Zhang introduced the page coloring scheme, to assign different color to different page mapping to different cache page, and pages mapping to same cache page have same color. Pages with same color are grouped into bucket. If a task is cache consumed in run time, the kernel can allocate more buckets for it. Though I didn’t understand how the buckets can cope with DMA (I guess for physical address mapped cache, DMA transfer might invalidate some content in L1/2 cache), the benchmark number shows quite some improvement when some applications are assigned with more page buckets. Details can be found from [4]. Continuing on this idea, another work was introduced, it was a database prototype called MCC-DB on LCC (last level cache) shared machine. MCC-DB intends to analyze the cache consumption of each query task, and queue the taks with knowledge of cache usage prediction. The benchmark number says this scheme can reduce query execution times by up to 33% [5].
The second half of the day introduced the structure of DRAM row buffer, and explained why when L2 cache conflict happens, row buffer would always conflict (without the improved algorithm). Then Prof Zhang introduced an algorithm to keep the locality in DRAM and avoid row buffer conflict. The algorithm was surprisingly simple, just XOR 2 parts of a physical address. The story how they found the row buffer conflict, and how to analyze the source issue, was quite interesting. Details can be found from [2].

Day 3
The first half of the day started with disk cache replacement algorithm. Introduced LRU, and analyzed the disadvantage of LRU. A major issue was, too many accessed-once-only MRU data will flush out accessed-many-times LRU data. Then Prof Zhang introduced an algorithm called LIRS( Low Inter-reference Recency Set). The basic idea of LIRS is to make multi-level queues LRU. When a page is cached, just keep it in low level queue, only when it is accessed more than once, move it into high level queue. The cache pages in low level queue are the first choice to be replaced out. The whole algorithm is (IMHO) quite complex, and I can not tell all the details without reading the paper [6] again. Though LIRS is cool, it can not replace current improved LRU implementation in Linux kernel for page cache replacement. The key issue is, in LIRS, a quite complex queue insert/remove/move operation needs a big lock to protect involved stacks, which will be a big performance penalty in kernel space. Therefore, currently LIRS is used in user space, for example postgres (if I remember correctly).
The second half of the day continued with the page cache replacement algorithm. Prof Zhang introduced an simplified algorithm called Clock-Pro [7], which was an improvement based on Clock page replacement algorithm. Right now, there is patch that implements Clock-Pro, but not upstream yet. After Clock-Pro, Prof Zhang introduced another algorithm to avoid cache thrashing which was called Token-orderd LRU[3]. Basic idea of Token-ordered LRU is, when page cache is full, new data read-in has to replace existed pages. In order to avoid thrashing, there is a token assigned to a selected process. When replacing cache pages, do not replace pages of the process who has the token. This scheme can make sure the process who has token can execute faster and finish early to release more pages. Token-ordered LRU is in upstream Linux kernel, the implementation is around 50 lines C code, perfect small.

Day 4
The first half of the day again started with disk cache replacement. For harddisk, due to seeking and read ahead buffering, reading continuous data is much faster than reading random data. Therefore, caching random data in memory might have less performance penalty than caching continuous data when page fault happens. Because 2 LBA continous blocks does not mean they are adjacent in physical layout on hard disk. Prof Zhang introduced how they make judgement on whether LBA continous blocks were physically adjacent on disk. The basic idea is, tracking last 2 access time of each cached block. If two LBA continous blocks are read-in within a small interval, these 2 blocks are physically continous on harddisk. Then when buffer cache gets full, replace the physically continous blocks first [8]. IMHO, the drawback of this algorithm is, there has to be a quite big tree structure inside kernel space to trace access time of each block. And the timestamp is made by reading TSC, I am not sure whether there is performance penalty to read TSC on per-block-reading.
The second half of the day was all for SSD. Prof Zhang spent much time to introduce the basic structure of SSD, and their work in cooperation with Intel. They did quite a lot benchmark on Intel SSD (X25-E IIRC), and observed interesting performance numbers[9]. One of the numbers impressed me was, continuous reads is faster than random reads. I asked Prof Zhang, a dirty block should be erased before next writing, was it a nature feature to implement a copy-on-write file system (though the erase block size is some how bigger than file system block size). There was no explicit answer of this question.

Day 5
Today we discussed data transfer and cache managerment on internet, especially on P2P data sharing. The Professor introduced their research to explain why P2P is the most efficient method to transfer multimedia data on internet. The basic idea is multimedia information transfer on internet is not a zipf-like distribution[10], a topology (of users’ location) optimized P2P network can cache the most accessed multimedia data in the users’ local network , which can maximize data transfer bandwidth and speed for multimedia data on internet[11].
This was the last topic of the whole course. After this talk, there was a summary speech by Dragon Star program organizers. Then the 5 days course concluded.

From the above description, though this course touched many fields from CPU cache to internet P2P protocol, there is only one thread: improve data access performance by caching. IMHO, there are 3 methods to improve system I/O performance, cache, duplicated and prefetch. This course provides a quite systematic introduction on the first solution — caching. To design a cache system (no matter where it is located in the storage stack), some key issues should be considered: cache lookup, cache conflict, and cache replacement. This course shares the experiences of a group of people (Prof Zhang’s team) on how to improve overall system performance against these issues. As a file system developer, all the content of this course perfectly hit my professional area, which brings me many new conceptions and provides me a chance to learn how experienced people think and solve problem.

Prof Zhang, please accept my sincerely regards, for your great job in these days. You were so kind to share your knowledge with us, worked so hard even with 40C+ high temperature. Wish we can meet somewhere sometime again … …
Finally I want to thank the dragon star program office to organize such a successful course, so I have chance to know different people and their great jobs. Also I should thank my employer, a great company, to send me to Changsha, give me an opportunity to have such an excited experience.

All slide files of this course can be found from

[1] C. Zhang, X. Zhang, and Y. Yan, “Two fast and high-associativity cache schemes”, IEEE Micro, Vol. 17, No. 5, September/October, 1997, pp. 40-49.
[2] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality”, Proceedings of the 33rd Annual International Symposium on Microarchitecture, (Micro-33), Monterey, California, December 10-13, 2000. pp. 32-41.
[3] Song Jiang and Xiaodong Zhang, “Token-ordered LRU: an effective page replacement policy and its implementation in Linux systems”, Performance Evaluation, Vol. 60, Issue 1-4, 2005, pp. 5-29.
[4] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan, “Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems”, Proceedings of of the 14th International Symposium on High Performance Computer Architecture (HPCA’08), Salt Lake City, Utah, February 16-20, 2008.
[5] Rubao Lee, Xiaoning Ding, Feng Chen, Qingda Lu, and  Xiaodong Zhang , “MCC-DB: minimizing cache conflicts in muli-core processors for databases”, Proceedings of 35th International Conference on Very Large Data Bases, (VLDB 2009), Lyon, France, August 24-28, 2009.
[6] Song Jiang and Xiaodong Zhang, LIRS: an efficient low inter-reference recency set replacement to improve buffer cache performance, Proceedings of the 2002 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, (SIIMETRICS’02), Marina Del Rey, California, June 15-19, 2002.
[7] Song Jiang, Feng Chen, and Xiaodong Zhang, “CLOCK-Pro: an effective improvement of the CLOCK replacement”, Proceedings of 2005 USENIX Annual Technical Conference (USENIX’05), Anaheim, CA, April 10-15, 2005.
[8] Xiaoning Ding, Song Jiang, Feng Chen, Kei Davis, and Xiaodong Zhang, “DiskSeen: exploiting disk layout and access history to enhance I/O prefetch”, Proceedings of the 2007 USENIX Annual Technical Conference, (USENIX’07), Santa Clara, California, June 17-22, 2007.
[9] Feng Chen, David Koufaty, and Xiaodong Zhang, “Understanding intrinsic characteristics and system implications of flash memory based solid state drives”, Proceedings of 2009 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems}, (SIGMETRICS/Performance 2009), Seattle, WA, June 15-19, 2009.
[10] Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang, “Does Internet media traffic really follow Zipf-like distribution?”, Proceedings of ACM SIGMETRICS’07 Conference, (Extended Abstract), San Diego, California, June 12-16, 2007.
[11] Lei Guo, Songqing Chen, and Xiaodong Zhang, “Design and evaluation of a scalable and reliable P2P assisted proxy for on-demand streaming media delivery”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 5, 2006, pp. 669-682.

July 3, 2009

gcc macro and preprocessor

Filed under: Basic Knowledge — Tags: — colyli @ 3:21 am

These days, I am checking sys_mount() implementation. During the code reading, I find I can not tell most of the details of macro stringification and concatenation now.

After spending 1.5 days on google and gcc online manual, I re-read macro and preprocessor related part, especially on the stringification and concatenation. After the reading, I write an document for most of the outlines. The document can be found from here [click the URL to download].

Wish it can be helpful to others, of cause I will re-read this doc again and again in future 🙂

July 1, 2009

2 years in SuSE Labs

Filed under: Great Days — Tags: — colyli @ 12:46 pm

July 1st 2007, I joined SuSE Labs kernel team. Today it’s 2 years already, time is flying … …

I could do better in the past 2 years … …

Wish I can contribute more in the next year 🙂

June 29, 2009

when failed to apply openSuSE patches

Filed under: kernel — Tags: — colyli @ 2:02 am

openSuSE kernel is an upstream kernel tree with a set of patches. Last week, the second time I encountered the patch applying failure. From the .rej file, I could not see any conflict. This problem blocked me for the whole weekend, until my colleague Jeff Mohoeny and Jiri Slaby showed me the magic.

Jiri told me there were hardlinks from the git repository to the patched directory. If some files from the git repository got modified, the one in patched directory also got modified. Unless the modification got patched or edited by quilt edit, it should be problematic. I removed all the decompressed upstream kernel tree and repatched everything from the tar.bz2 upstream code file, everything backed to work.

Jeff kindly replied me a hint to solve this issue when using vim (cause it will try to edit in place), which is adding something into .vimrc:

set backup

if  version > 700

set    backupcopy=auto,breakhardlink


set    backupcopy=no


Jeff also told me the “if” part was because the SLES9 vim does not support the breakhardlink option, but it’ll still do the right thing.

By this experience, I realized it’s time to setup a blog for myself. Here it is 🙂

June 10, 2009

Hello world!

Filed under: Uncategorized — colyli @ 9:28 pm

Yeah, finally I start my blog. It’s not too bad to use “Hello world” as the first blog title 🙂

« Newer Posts

Powered by WordPress