In July 13~17, I took a course called ‘Dragon Star’ programming. Here is my meeting report of this great event.
At the very beginning, please permit me to thank my Labs colleagues Nikanth Karthikesan and Suresh Jayaraman for their kindly help. They review the draft version of this report during their busy working hours, provide many valuable comments and make the final version’s quality much better.
Dragon Star is a series of computer science academic communications. It’s sponsored by China Natural Science Fund, to invite outstanding overseas Chinese professor to give systematic training for post graduate students. Dragon Star programming office is located at Institute of Computing Technology of Chinese Academy of Science.
In the past several years, most attendees of this program are university students, researcher from state owned institute or organization. Due to the limited seats for each course, this program seldom accepts applications from multi-national companies (like Novell). This year, it’s quite surprising that my application was approved, therefore I have a chance to know people from state owned institute/organization and exchange ideas with professor and students.
The course I applied was “Infrastructure of data-binded applications”, which was taught by Professor Xiaodong Zhang from Ohio State University. Frankly speaking, it was a seminar rather than a course, open discussions happened freely, professor was willing to hear and discuss with students. Though it was a course for post graduate students, most of the audience were PhD student, researcher and professor from universities. The venue was in Changsha Institute of Technology, where Changsha is another city in China, famous for its crazy high temperature in summer. I was surprised to know that many students in this university stayed in school during the summer holiday and also joined the training.
In the 4.5 days course, Prof Zhang touched quite a lot layers of the whole computer storage stack, from CPU L1/2/3 caches, to main memory, swapping in virtual memory and buffer cache for disk data, and finally SSD. Though many fields were mentioned, the discussion was always focused on one topic — caching. As an file systems engineer, caching is something I should know and I must know. At the end of the course, I did feel that the content of this course was much more beyond my expectation.
Here I share some of my experiences from this course.
– Most of my computer knowledge was learned by myself, for CPU caches organization, I had a misunderstood concept of fully associative cache and direct mapped cache for long. I thought, fully associative cache was much faster than direct mapped cache, because it could compare the cached value in parallel. But full associated cache was too expensive, therefore direct mapped cache was introduced. From this training, I realized direct mapped cache is much faster than full associated cache. The short coming of direct mapped cache was cache conflict handling, which is just the advantage of fully associative cache. Even a 2 way set-assocated cache can improve a lot on cache conflict because it can cache different values in these 2 ways. Then I learned a new (to me) idea to improve cache look-up speed in set-associative cache .
– In this training, I also re-learned the conception of row buffer in memory controller. The professor told us they observed an interesting condition. Because the row buffer only cache one page size data, when there was a cache conflict in L2/3 cache, there should be a buffer conflict in row buffer. The conflict means, new content had to replace the old content in same place. Then he introduced their work on how to solve this issue . The solution was quite simple, but how they observed this issue, and how they made the analysis, this was a perfect story.
– As a Linux kernel developer, 2 talks by this professor helped me to understand the Linux virtual memory management a little bit easier. One was a method to improve page replacement policy when cache memory is full, called token-ordered LRU, which could minimize the possibility to paging thrashing . Another was to avoid caching too many access-once-only pages in cache memory, which was called clock-pro. I knew the token-ordered LRU for days, but this was the first time to meet one of the algorithm authors. For clock-pro, from google I knew there was patches and not upstream yet. If you are interested on these topics, please check the reference.
– Prof Zhang’s team also did research on SSD (Solid State Disk) caching. They ran different I/O pattern and got some interesting performance numbers. One of the performance numbers impressed me was, random reads and sequential reads had recoganized performance difference on Intel X25-E SSD. In the past 2 years, I took it for granted that reading from SSD were all in same speed. The explanation here was that, there was read-ahead buffer inside the SSD, therefore sequential reads is faster than random reads. Caching is everywhere!
When I stayed in Changsha, the temperature was 40+ degree Celsius. Everyday, I walked 6km between dorm and classroom (a big university) in the sunshine and high temperature, my T-shirt was wet all the time. In the evening, it was good time to read related papers and review the topics which Prof Zhang mentioned in day time. I had a feeling that if I learned these topics myself, I would have spent 3-6 months more.
The course was divided into several talks, here I list all the talks in the order of when they were taken. For details of each talk, if you are interested, please check the reference, you can find papers and slide files there. The only pitty is, that the slide is made by MS office, might not be perfectly compatible with OpenOffice.
The top half day was overview, topic was “Balancing System Resource Supply and Demand for Effective Computing”.
The bottom half days we discussed processor cache design principles, included,
– Basic logical structure of CPU cache
– The trade-off between hit rate and access delay
– Cache design for high hit rate with low access delay
Today, the new word for me was multi-column cache. Since most of the architecture book (I read) stops at N-way set-assocaited cache, this was the first time I knew multi-column cache.
The first half of the day started with last day’s topic, cache management in multi-core system. Prof Zhang introduced the page coloring scheme, to assign different color to different page mapping to different cache page, and pages mapping to same cache page have same color. Pages with same color are grouped into bucket. If a task is cache consumed in run time, the kernel can allocate more buckets for it. Though I didn’t understand how the buckets can cope with DMA (I guess for physical address mapped cache, DMA transfer might invalidate some content in L1/2 cache), the benchmark number shows quite some improvement when some applications are assigned with more page buckets. Details can be found from . Continuing on this idea, another work was introduced, it was a database prototype called MCC-DB on LCC (last level cache) shared machine. MCC-DB intends to analyze the cache consumption of each query task, and queue the taks with knowledge of cache usage prediction. The benchmark number says this scheme can reduce query execution times by up to 33% .
The second half of the day introduced the structure of DRAM row buffer, and explained why when L2 cache conflict happens, row buffer would always conflict (without the improved algorithm). Then Prof Zhang introduced an algorithm to keep the locality in DRAM and avoid row buffer conflict. The algorithm was surprisingly simple, just XOR 2 parts of a physical address. The story how they found the row buffer conflict, and how to analyze the source issue, was quite interesting. Details can be found from .
The first half of the day started with disk cache replacement algorithm. Introduced LRU, and analyzed the disadvantage of LRU. A major issue was, too many accessed-once-only MRU data will flush out accessed-many-times LRU data. Then Prof Zhang introduced an algorithm called LIRS( Low Inter-reference Recency Set). The basic idea of LIRS is to make multi-level queues LRU. When a page is cached, just keep it in low level queue, only when it is accessed more than once, move it into high level queue. The cache pages in low level queue are the first choice to be replaced out. The whole algorithm is (IMHO) quite complex, and I can not tell all the details without reading the paper  again. Though LIRS is cool, it can not replace current improved LRU implementation in Linux kernel for page cache replacement. The key issue is, in LIRS, a quite complex queue insert/remove/move operation needs a big lock to protect involved stacks, which will be a big performance penalty in kernel space. Therefore, currently LIRS is used in user space, for example postgres (if I remember correctly).
The second half of the day continued with the page cache replacement algorithm. Prof Zhang introduced an simplified algorithm called Clock-Pro , which was an improvement based on Clock page replacement algorithm. Right now, there is patch that implements Clock-Pro, but not upstream yet. After Clock-Pro, Prof Zhang introduced another algorithm to avoid cache thrashing which was called Token-orderd LRU. Basic idea of Token-ordered LRU is, when page cache is full, new data read-in has to replace existed pages. In order to avoid thrashing, there is a token assigned to a selected process. When replacing cache pages, do not replace pages of the process who has the token. This scheme can make sure the process who has token can execute faster and finish early to release more pages. Token-ordered LRU is in upstream Linux kernel, the implementation is around 50 lines C code, perfect small.
The first half of the day again started with disk cache replacement. For harddisk, due to seeking and read ahead buffering, reading continuous data is much faster than reading random data. Therefore, caching random data in memory might have less performance penalty than caching continuous data when page fault happens. Because 2 LBA continous blocks does not mean they are adjacent in physical layout on hard disk. Prof Zhang introduced how they make judgement on whether LBA continous blocks were physically adjacent on disk. The basic idea is, tracking last 2 access time of each cached block. If two LBA continous blocks are read-in within a small interval, these 2 blocks are physically continous on harddisk. Then when buffer cache gets full, replace the physically continous blocks first . IMHO, the drawback of this algorithm is, there has to be a quite big tree structure inside kernel space to trace access time of each block. And the timestamp is made by reading TSC, I am not sure whether there is performance penalty to read TSC on per-block-reading.
The second half of the day was all for SSD. Prof Zhang spent much time to introduce the basic structure of SSD, and their work in cooperation with Intel. They did quite a lot benchmark on Intel SSD (X25-E IIRC), and observed interesting performance numbers. One of the numbers impressed me was, continuous reads is faster than random reads. I asked Prof Zhang, a dirty block should be erased before next writing, was it a nature feature to implement a copy-on-write file system (though the erase block size is some how bigger than file system block size). There was no explicit answer of this question.
Today we discussed data transfer and cache managerment on internet, especially on P2P data sharing. The Professor introduced their research to explain why P2P is the most efficient method to transfer multimedia data on internet. The basic idea is multimedia information transfer on internet is not a zipf-like distribution, a topology (of users’ location) optimized P2P network can cache the most accessed multimedia data in the users’ local network , which can maximize data transfer bandwidth and speed for multimedia data on internet.
This was the last topic of the whole course. After this talk, there was a summary speech by Dragon Star program organizers. Then the 5 days course concluded.
From the above description, though this course touched many fields from CPU cache to internet P2P protocol, there is only one thread: improve data access performance by caching. IMHO, there are 3 methods to improve system I/O performance, cache, duplicated and prefetch. This course provides a quite systematic introduction on the first solution — caching. To design a cache system (no matter where it is located in the storage stack), some key issues should be considered: cache lookup, cache conflict, and cache replacement. This course shares the experiences of a group of people (Prof Zhang’s team) on how to improve overall system performance against these issues. As a file system developer, all the content of this course perfectly hit my professional area, which brings me many new conceptions and provides me a chance to learn how experienced people think and solve problem.
Prof Zhang, please accept my sincerely regards, for your great job in these days. You were so kind to share your knowledge with us, worked so hard even with 40C+ high temperature. Wish we can meet somewhere sometime again … …
Finally I want to thank the dragon star program office to organize such a successful course, so I have chance to know different people and their great jobs. Also I should thank my employer, a great company, to send me to Changsha, give me an opportunity to have such an excited experience.
All slide files of this course can be found from http://blog.coly.li/docs/dragonstar/dragonstar.tar.bz2
 C. Zhang, X. Zhang, and Y. Yan, “Two fast and high-associativity cache schemes”, IEEE Micro, Vol. 17, No. 5, September/October, 1997, pp. 40-49.
 Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality”, Proceedings of the 33rd Annual International Symposium on Microarchitecture, (Micro-33), Monterey, California, December 10-13, 2000. pp. 32-41.
 Song Jiang and Xiaodong Zhang, “Token-ordered LRU: an effective page replacement policy and its implementation in Linux systems”, Performance Evaluation, Vol. 60, Issue 1-4, 2005, pp. 5-29.
 Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan, “Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems”, Proceedings of of the 14th International Symposium on High Performance Computer Architecture (HPCA’08), Salt Lake City, Utah, February 16-20, 2008.
 Rubao Lee, Xiaoning Ding, Feng Chen, Qingda Lu, and Xiaodong Zhang , “MCC-DB: minimizing cache conflicts in muli-core processors for databases”, Proceedings of 35th International Conference on Very Large Data Bases, (VLDB 2009), Lyon, France, August 24-28, 2009.
 Song Jiang and Xiaodong Zhang, LIRS: an efficient low inter-reference recency set replacement to improve buffer cache performance, Proceedings of the 2002 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, (SIIMETRICS’02), Marina Del Rey, California, June 15-19, 2002.
 Song Jiang, Feng Chen, and Xiaodong Zhang, “CLOCK-Pro: an effective improvement of the CLOCK replacement”, Proceedings of 2005 USENIX Annual Technical Conference (USENIX’05), Anaheim, CA, April 10-15, 2005.
 Xiaoning Ding, Song Jiang, Feng Chen, Kei Davis, and Xiaodong Zhang, “DiskSeen: exploiting disk layout and access history to enhance I/O prefetch”, Proceedings of the 2007 USENIX Annual Technical Conference, (USENIX’07), Santa Clara, California, June 17-22, 2007.
 Feng Chen, David Koufaty, and Xiaodong Zhang, “Understanding intrinsic characteristics and system implications of flash memory based solid state drives”, Proceedings of 2009 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems}, (SIGMETRICS/Performance 2009), Seattle, WA, June 15-19, 2009.
 Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang, “Does Internet media traffic really follow Zipf-like distribution?”, Proceedings of ACM SIGMETRICS’07 Conference, (Extended Abstract), San Diego, California, June 12-16, 2007.
 Lei Guo, Songqing Chen, and Xiaodong Zhang, “Design and evaluation of a scalable and reliable P2P assisted proxy for on-demand streaming media delivery”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 5, 2006, pp. 669-682.