An enjoyed kernel apprentice Just another WordPress weblog

October 13, 2016

Why 4KB I/O requests are not merged on DM target

Filed under: File System Magic,kernel — colyli @ 10:00 am

(This article is for SLE11-SP3, which is based on Linux 3.0 kernel.)

Recently people report that on SLE11-SP3, they observe I/O requests are not merged on device mapper target, and ‘iostat’ displays average request only in 4KB size.

This is not a bug, no negative performance impact. Here I try to explain why this situation is not a bug and how it happens, a few Linux file system and block layer stuffs will be mentioned, but it won’t be complexed to understand.

The story is, from a SLE11-SP3 machine, a LVM volume is created (as linear device mapper target), and ‘dd’ is used to generate sequential WRITE I/Os on to this volume. People tried to use buffered I/O and direct I/O with ‘dd’ command, on raw device mapper target, or on an ext3 file system on top of the device mapper target. So there are 4 conditions,

1) buffered I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M
2) direct I/O on raw device mapper target,
dd if=/dev/zero of=/dev/dm-0 bs=1M oflag=direct
3) buffered I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M
4) direct I/O on ext3 file system (on top of the device mapper target),
dd if=/dev/zero of=/mnt/img bs=1M oflag=direct

For 2) and 4), large request sizes are observed from hundreds to thousands sectors, maximum request size is 2048 sectors (because bs=1M). But for 1) and 3), all the request sizes displayed from ‘iostat’ on device mapper target dm-0 are 8 sectors (4KB).

The question is, sequential write I/Os are supposed to be merged into larger ones, why the request size reported by ‘iostat’ from device mapper target /dev/dm-0 is only 4KB, and not merged into larger request size ?

At first, let me give the simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.

Let me explain the above 2 points of simple answer in details. For direct I/O, the request size in device mapper target is the actual size sent from upper layer, it might be directly from application buffer, or adjusted by file system. so we only look at buffered I/O cases.

a) device mapper target does not merge bios
Device mapper only handles bios. In case of linear device mapper target (a common & simple lvm volume), it only re-maps the original bio from the logical device mapper target to actual underlying storage device, or maybe split the bio (of the device mapper target) into smaller ones if the original bio goes across multiple underlying storage devices. It never combines small bios into larger ones, it just re-maps the bios, and submit them to underlying block layer. Elevator, a.k.a I/O scheduler handles request merging and scheduling, device mapper does not.

b) upper layer code issues only 4KB size bios
For buffered I/O, file system only dirties the pages which contains the data writing to disk, the actual write action is handled by write back and journal code automatically,
– journal: ext3 uses jbd to handle journaling, in data=ordered mode, jbd only handles meta data blocks, and submit the metadata I/Os in buffer head, which means the maximum size is one page (4KB).
– write back: the actual kernel code to submit I/O to disk in write back code path is mm/page-writeback.c:do_writepages(). In SLE11-SP3 it looks like this,

1084 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
1085 {
1086         int ret;
1087
1088         if (wbc->nr_to_write <= 0)
1089                 return 0;
1090         if (mapping->a_ops->writepages)
1091                 ret = mapping->a_ops->writepages(mapping, wbc);
1092         else
1093                 ret = generic_writepages(mapping, wbc);
1094         return ret;
1095 }

Device mapper target is created on devtmpfs, which does not have writepages() method defined. Ext3 does not have writepages() method defined in its a_ops set neither, so both conditions will go into generic_writepages().

Inside generic_writepages(), the I/O code path is: generic_writepages()==>write_cache_pages()==>__writepage()==>mapping->a_ops->writepage(). For different conditions, the implementation of mapping->a_ops->writeback() are different.

b.1) raw device mapper target
In SLE11-SP3, block device mapping->a_ops->writepage() is defined in fs/block_dev.c:blkdev_writepage(), its code path to submit I/O is: blkdev_writepage()==>block_write_full_page()==>block_write_full_page_endio()==>__block_write_full_page(). In __block_write_full_page(), finally a buffer head contains this page is submitted to underlying block layer by submit_bh(). So device mapper layer only receives bio with 4KB size in this case.

b.2) ext3 file system on top of raw device mapper target
In SLE11-SP3, mapping->a_ops->writeback() method in ext3 file system is defined in three ways, corresponding to three different data journal modes. Here I use data=ordered mode as the example. Ext3 uses jbd as its journaling infrastructure, when journal works in data=ordered mode (the default mode in SLE11-SP3), mapping->a_ops->writeback() is defined as fs/ext3/inode.c:ext3_ordered_writepage(). Inside this function, block_write_full_page() is called to write page to block layer, same to the raw device mapper target condition, finally submit_bh() is called to submit bio with one page to device mapper layer. Therefore in this case, device mapper target still only receives bios with 4KB size.

Finally let’s back to my first simple answer: a) this device mapper target does not merge small bios into large one, and b) upper layer code only issues 4KB size bios to device mapper layer.

 

February 7, 2011

alloc_sem of Ext4 block group

Filed under: File System Magic — colyli @ 11:53 am

Yesterday Amir Goldstein sent me an email for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, he provides a quite clear answer. I feel Ted’s answer is also very informative to me, I copy&past the conversation from linux-ext4@vger.kernel.org to my blog. The copy rights of the bellowed referenced text belong to their original authors.

On Sun, Feb 06, 2011 at 10:43:58AM +0200, Amir Goldstein wrote:
> When looking at alloc_sem, I realized that it is only needed to avoid
> race with adjacent group buddy initialization.
Actually, alloc_sem is used to protect all of the block group specific
data structures; the buddy bitmap counters, adjusting the buddy bitmap
itself, the largest free order in a block group, etc.  So even in the
case where block_size == page_size, alloc_sem is still needed!
– Ted

October 16, 2010

China Linux Storage and File System Workshop 2010

Filed under: File System Magic,Great Days,kernel — colyli @ 12:41 pm

[CLSF 2010, Oct 14~15, Intel Zizhu Campus, Shanghai, China]

Similar to Linux Storage and File System Summit in north America, China Linux Storage and File System Workshop is a chance to make most of active upstream I/O related kernel developers get together and share their ideas and current status.

We (CLSF committee) invited around 26 persons to China LSF 2010, including community developers who contribute to Linux I/O subsystem, and engineers who develop their storage products/solutions based on Linux. In order to reduce travel cost to all attendees, we decided to co-locate China LSF with CLK (China Linux Kernel Developers Conference) in Shanghai.

This year, Intel OTC (Opensource Technology Center) contributed a lot to the conference organization. She kindly provided free and comfortable conference room, donated employees to help the organization and preparation, two intern students acted as volunteers helping on many trivial stuffs.

CLSF2010 is a two days’ conference,  here are some interesting topics (IMHO) which I’d like to share on my blog. I don’t understand very well on every topic, if there is any error/mistake in this text, please let me know. Any errata is welcome 🙂

— Writeback, led by Fengguang Wu

— CFQ, Block IO Controller & Write IO Controller, led by Jianfeng Gui, Fengguang Wu

— Btrfs, led by Coly Li

— SSD & Block Layer, led by Shaohua Li

— VFS Scalability, led by Tao Ma

— Kernel Tracing, led by Zefan Li

— Kernel Testing and Benchmarking, led by Alex Shi

Beside the above topics, we also had ‘From Industry’ sessions, engineers from Baidu, Taobao and EMC shared their experience when building their own storage solutions/products based on Linux.

In this blog, I’d like to share the information I got from CLSF 2010, hope it could be informative 😉

Write back

The first session started from Write back,  which is quite hot recently. Fengguang does quite a few work on it, and kindly volunteer to lead this session.

An idea was brought out to limit the dirty page ratio by per-process. Fengguang made a patch and shared a demo picture with us. When dirty pages exceeds the up-limit specified to a process, kernel will write back the dirty pages of this process smoothly, until the dirty page numbers reduced to a pre-configured rate. This idea is helpful to processes hold a large number of dirty pages.  Some people concerned this patch didn’t help the condition that a lot of processes and each hold a few dirty pages. Fengguang replied for server application, if this condition happened, the design might be buggy.

People also mentioned now the erase block size of SSD increased from KBs to MBs, adopting a bigger page numbers in writing out may help on the whole file system performance. Engineers from Baidu shared their experience,

— Increase the write out size from 4MB to 40MB, they achieved 20% performance improvement.

— Use extent based file system, they got better continuous on-disk layout and less memory consume for metadata.

Fengguang also shared his idea on how to control process to write pages, the original idea was control dirty pages by I/O (calling writeback_inode(dirtied * 3/2)), after several times improvement it became wait_for_writeback(dirteid/throttle_bandwidth). By this means, the I/O bandwidth of dirty pages to a process also got controlled.

During the discussion, Fengguang pointed out the event that a page got dirty was more important than whether a page was dirty. Engineers from Baidu said, in order to avoid a kernel/user space memory copy during file read/write, while using kernel page cache, they used mmap to read/write file pages other than calling read/write syscalls. In this case, a page writable in mmap is initialized as read only firstly, when the writing happened a page fault was triggered, then kernel knew this page got dirty.

It seems many ideas are under working to improve the writeback performance, including active writeback in back group, and some cooperation with underlying block layer. My current focus is not here, anyway I believe people in the room could help a bit out 🙂

Btrfs

Recently, there are many developers in China start to work on btrfs, e.g. Xie Miao, Zefan Li, Shaohua Li, Zheng Yan, … Therefore we specially arranged a two hours session for btrfs. The main purpose of the btrfs session is to share what we are doing on btrfs.

Most of people agreed that btrfs needed a real fsck tool now. Engineers from Fujitsu said they had a plan to invest people on btrfs checking tool development. Miao Xie, Zefan Li, Coly Li and other developers suggested to consider the pain of fsck from beginning,

— memory consuming

Now a 10TB+ storage media is cheap and common, for large file system built on them, doing fsck needs more memory to hold meta data (e.g. bitmap, dir blocks, inode blocks, btree internal blocks …). For online fsck, consuming too many memory in file system checking will have negative performance impact to page cache or other applications. For offline fack, it was not a problem, now online fsck is coming, we have to encounter this open question now 🙂

— fsck speed

A tree structured file system has (much) more meta data than a table structured file system (like Ext2/3/4), which may mean more I/O and more time. For a 10TB+ 80% full file system, how to reduce the file system checking time will be a key issue, especially for online service workload. I proposed an solution, allocating metadata to SSD or other higher seek speed device, then checking on metadata may have no (or a little) seeking time, which results a faster file system checking.

Weeks before, two intern students Kunshan Wang and Shaoyan Wang, they worked with me, wrote a very basic patch set (including kernel and user space code), to allocate metadata from a higher seek time device. This patch set is compiling passed, the students did a quite basic verification on meta data allocation, the patch worked. I don’t review the patch yet, by a quite rough code checking, there is much improvement needed. I post this draft patch set to China LSF mailing list, to call for more comments from CLSF attendees. Hope next month,  I can have time to improve the great job by Kunshan and Shaoyan.

Zefan Li said there was a todo list of btrfs, a long term task was data de-duplication, and a short term task was allocating data from SSD. Herbert Xu pointed out, the underlying storage media impacted file system performance quite a lot, from a benchmark from Ric Wheeler of Redhat, on Fusion IO high end PCI-E SSD, there is almost no performance difference between well known file system like xfs, ext2/3/4 or btrfs.

People also said that these days, the code review or merge of btrfs patches were often delayed, it seemed btrfs maintainer was too busy to handle the community patches. There was reply from the maintainer that the condition will be improved and patches would be handled in time, but there was no obvious improvement so far. I can understand when a person has more emergent task like kernel tree maintenance, he or she does have difficulty to handle non-trivial patches in time if this is not his or her highest priority job. From CLSF, I find more and more Chinese developers start to work on btrfs, I hope they should be patient if their patches don’t get handled in time 🙂

Engineers from Intel OTC mentioned there is no btrfs support from popular boot loader like Grub2. For me, IIRC there is someone working on it, and the patches are almost ready. Shaohua mentioned why not loading the Linux kernel by a linux kernel, like the kboot project does. People pointed out there still should be something to load the first Linux kernel, this was a chicken-and-egg question 🙂 My point was, it should not be very hard to enable the btrfs support in boot loader, a small Google Summer of Code project could make it. I’d like to port and merge the patches (if they are available) to openSUSE since I maintain openSuSE grub2 package.

Shaohua Li shared his experience on btrfs development for Meego project, he did some work on fast boot and read ahead on btrfs. Shaohua said there was some performance advance observed on btrfs, and the better result was achieved by some hacking, like a big read ahead size, a dedicated work queue to handle write request and using a big write back size. Fengguang Wu and Tao Ma pointed out this might be a general hacking, because Ext4 and OCFS2 also did the similar hacking for better performance.

Finally Shaohua Li pointed out there was a huge opportunity to improve the scalability of btrfs, since there still were many global locking, cache missing existing in current code.

SSD & Block Layer

This was a quite interesting session led by Shaohua Li. Shaohua started the session by some observed problems between SSD and block layer,

— Throughput is high, like network

— Disk controller gap, no MSI-x…

— Big locks, queue lock, scsi host lock, …

Shaohua shared some benchmark result showed that for high IOPS the interrupt over loaded on a single CPU,  even on a multi processors system, the interrupts could not be balanced to multi processors, which was a bottleneck to handle interrupts invoked by I/O of SSD.  If a system had 4 SDDs, a processor ran 100% to handle the interrupts and how throughput was around 60%-80%.

A workaround here was polling. Replacing interrupt by blk_iopoll could help the performance number, which could reduce processor overload on interrupts handling. However, Herbert Xu points out the key issue was current hardware didn’t support multi-queue to handle same interrupts. Different interrupts could be balanced to every processor in the system, but unlike network hardware, same interrupt could not be balanced into multi-queue and only be handled by a single processor. A hardware multi-queue support should be the silver bullet.

For SSD like Fusion IO produces, the IOPS could be one million + IOPS on a single SSD device, the parallel load is much more higher than on traditional hard disk. Herbert, Zefan and I agreed that some hidden race defect should be observed very soon.

Right now, block layer is not ready for such high parallel I/O load.  Herbert Xu pointed out that lock contention might be a big issue to solve. The source of the lock contention was cache consistence cost for global resource which protected by locking. Convert the global resource to a per-CPU local data might be a direction to solve the locking contention issue. Since Jens and Nick can touch Fusion IO devices more conveniently, we believe they can work with other developers to help out a lot.

Kernel Tracing

Zefan Li helped to lead an interesting session about kernel tracing. I don’t have any real understanding for any kernel trace infrastructure, for me the only tool is printk(). IMHO printk is the best trace/debug tool for kernel programming. Anyway, debugging is always an attractive topic to curious programmer, and I felt Zefan did his job quite well 🙂

The OCFS2 developer Tao Ma, mentioned OCFS2 currently using a printk wrapper trace code, which was not flexible and quite obsolete, OCFS2 developers were thinking of using a trace infrastructure like ftrace.

Zefan pointed out using ftrace to replace previous printk based trace messages should be careful, there might be ABI (application binary interface) issue for user space tools. Some user space tools work with kernel message (one can check kernel message with kmesg command). An Intel engineer mentioned there was accident recently that a kernel message modification caused the powertop tools didn’t work correctly.

For file system trace, the situation might be easier. Because most of the trace info was used by file system developers or testers, the one adding trace info into file system code might ignore the ABI issue with happy. Anyway, it was just “might”, not “be able to”.

Zefan said there was patch introduced TRACE_EVENT_ABI, if some trace info could form a stable user space ABI they could be announced by TRACE_EVENT_ABI.

This session also discussed how ftrace working. Now I know the trace info stored in a ring buffer. If ftrace is enabled but the ring buffer is not, user is still not able to receive trace info. People also said that a user space trace tool would be necessary.

Someone said perf tool currently getting more and more powerful, it was probably that integrating trace function into perf. Linux kernel only needs one trace tool,  some people in this workshop think it might be perf (for me, I have no point, because I use neither).

Finally Herbert again suggested people to pay attention on scalability issues when adding trace point. Currently the ring buffer was not a per-CPU local area, adding trace point might introduce performance regression for existing optimized code.

From Industry

In last year’s BeijingLSF, we invited two engineers from Lenovo. They shared their experience using Linux as the base system for their storage solution. This session had a quite positive feed back, and all committee member suggested to continue the From Industry sessions again this year.

For ChinaLSF2010, we invited 3 companies to share their ideas with other attendees, engineers from Baidu, Taobao and EMC  led three interesting sessions, people had chance to know which kind of difficulties they encountered, how they solved the problems and what they achieved from their solution or work around. Here I share some interesting points on my blog.

From Taobao

Engineers from Taobao also shared their works based on Linux storage and file systems,  the projects were Tair and TFS.

Tair is a distributed cache system used inside Taobao, TFS is a distributed user space file system to store Taobao goods pictures.  For detail information, please check http://code.taobao.org 🙂

From EMC

Engineers from EMC shared their work on file system recovery, especially file system checking. Tao Ma and I, we also mentioned what we did in fsck.ocfs2 (ocfs2 file system checking tool). The opinion from EMC was, even an online file system checking was possible, the offline fsck was still required. Because an offline file system checking could check and fix a file system from a higher level scope.

Other points were also discussed in previous sessions, including memory occupation, time consuming …

From Baidu

This was the first time I knew people from Baidu, and had chance to knew what they did on Linux kernel. Thanks to Baidu kernel team, we had opportunity to know what they did in the past years.

Guangjun Xie from Baidu started the session by introducing Baidu’s I/O workload, most of the I/O were indexing and distributed computing related, reading performance was more desired then writing performance. In order to reduce memory copying in data reading, they used mmap to read data pages from underlying media to page cache.  Accessing the page via mmap could not use the advantage of Linux kernel page cache replacement algorithm, while Baidu didn’t want to implement a similar page cache within user space. Therefore they used a not-beautiful-but-efficient workaround, they implemented an in-house system call, the system call updated the page (returned by mmap) in kernel’s page LRU. By this means, the data page could be management by kernel’s page cache code. Some people pointed out this was mmap() + read ahead. From Baidu’s benchmark their effort increased 100% searching workload performance on a single node server.

Baidu also tried to use bigger block size of Ext2 file system, to make data block layout more continuous, also from their performance data the bigger block size also resulted a better I/O performance. IMHO, a local mod ocfs2 file system may achieve a similar performance, because the basic block unit of ocfs2 is a cluster, the cluster size could be from 4KB to 1MB.

Baidu also tried to compress/decompress the data when writing/reading from disk, since most of Baidu’s data was text, the compress rate was quite satisfied high. They even used a PCIE compressing card, the performance result was pretty good.

Guangjun also mentioned, when they used SATA disks, some I/O error was silence error, for meta data, this was a fatal error, at least meta data checksum was necessary. For data checksum, they did it in application level.

Conclusion

Now comes to the last part of this blog, let me give my own conclusion to ChinaLSF 2010 🙂

IMHO, the organization and preparation this year is much better than BeijingLSF 2009, people from Intel Shanghai OTC contribute a lot of time and effort before/during/after the workshop, without their effort, we can not have such a successful event. Also a big thank you should give our sponsor EMC China, they not only sponsor conference expense, but also send engineers to share their development experience.

Let’s wait for next year for ChinaLSF 2011 🙂

July 25, 2010

Don’t waste your SSD blocks

Filed under: File System Magic — colyli @ 11:20 pm

These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was,

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 77418272 184216 73301344 1 /mnt

As well as from fdisk output, it said,

Device Boot Start End Blocks Id System
/dev/sdb1
7834 17625 78654240 83 Linux

From his observation, before format the SSD, there was 78654240 1k blocks available on the partition, after the format, 77418272 1k blocks could be used, which means almost 1G space unused from the partition.

A more serious question was, from the output of df, used blocks + available blocks = 73485560, but the file system had 77418272 blocks — 4301144 1k blocks disappeared ! This 160G SSD costs him 430USD, he complained around 15USD was payed for nothing.

IMHO, this is a quite interesting question, and asked by many people for many times. This time, I’d like to spend some time to explain how the blocks are wasted, and how to make better usage of every block on the SSD (since it’s quite expensive).

First of all, better storage usage depends on the I/O pattern in practice. This SSD is used to store large file for random I/O, especially most of the I/O (99%+) is reading on random file offset, the writing can almost be ignored. Therefore, it is wanted to use every available block to store a very big files on the Ext3 file systems.

If only using the default command line to format an Ext3 file system like “mkfs.ext3 /dev/sdb1”, mkfs.ext3 will do the following things for block allocation,

– Allocates reserved blocks for root user, to avoid non-privilege users using up all disk space.

– Allocates metadata like superblock, backed superblock, block group descriptors, block bitmap for each block group, inode bitmap for each block group, inode table for each block group.

– Allocates reserved block group blocks for offline file system extension.

– Allocates blocks for journal

Since the SSD is only for data storage, no operation system installed on it, and writing performance is disregarded here, and no requirement for further file system size extension, and only a few files are stored on the file systems, some blocks allocation is unnecessary and useless,

– Journal blocks

– Inodes blocks

– Reserved group descriptor blocks for file system resize

– Reserved blocks for root user

Let’s run dumpe2fs to see how many blocks are wasted on the above items, I only list part of the output (outlines) here,

> dumpe2fs /dev/sdb1

Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          f335ba18-70cc-43f9-bdc8-ed0a8a1a5ad3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              4923392
Block count:              19663560
Reserved block count:     983178
Free blocks:              19308514
Free inodes:              4923381
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1019
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512

Filesystem created:       Tue Jul  6 21:42:32 2010
Last mount time:          Tue Jul  6 21:44:42 2010
Last write time:          Tue Jul  6 21:44:42 2010
Mount count:              1
Maximum mount count:      39
Last checked:             Tue Jul  6 21:42:32 2010
Check interval:           15552000 (6 months)
Next check after:         Sun Jan  2 21:42:32 2011
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      3ef6ca72-c800-4c44-8c77-532a21bcad5a
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00000001
Journal start:            0

Group 0: (Blocks 0-32767)
Primary superblock at 0, Group descriptors at 1-5
Reserved GDT blocks at 6-1024
Block bitmap at 1025 (+1025), Inode bitmap at 1026 (+1026)
Inode table at 1027-1538 (+1027)
31223 free blocks, 8181 free inodes, 2 directories
Free blocks: 1545-32767
Free inodes: 12-8192

[snip ….]

The file system block size is 4KB, which is different from the output block size of df and fdisk. In the above output, I mark the outlines with RED color. Now let’s look at the line for reserved block,

Reserved block count:     983178

These 983178 4K blocks are served for root user, since the system and user home is not on SSD, we don’t need to reserve these blocks.  Read mkfs.ext3(8), there is a parameter ‘-m’ to set reserved-blocks-percentage, set ‘-m 0’ to reserve zero block for privilege user.

From file system features line, we can see resize_inode is one of the default enabled feature,

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file

resize_inode feature reserves quite a lot blocks for new extended block group descriptors, these blocks can be found from lines like,

Reserved GDT blocks at 6-1024

When resize_inode feature enabled, mkfs.ext3 will reserve some blocks after block group descriptor blocks, called “Reserved GDT blocks”.  If file system will be extended in future (e.g. the file system is created on a logical volume), these reserved blocks can be used for new block group descriptors. Now the storage media is SSD, not file system extension in future, we don’t have to pay money (on SSD, blocks means money) for this kind of blocks. To disable resize_inode feature, use “-O ^resize_inode” in mkfs.ext3(8).

Then look at these 2 lines for inode blocks,

Inodes per group:         8192
Inode blocks per group:   512

We only store no more than 5 files on the whole file systems,  but here 512 blocks in each block groups are allocated for inode table. There are 601 block groups, which means 512×601=307712 blocks (≈ 1.2GB space) wasted for inode tables.  Using ‘-N 16’ in mkfs.ext3(8) to specify only 16 inodes in the file system, though mkfs.ext3(3) at least allocate one inode table block in each block group (more then 16 inodes), we only wast 1 block other than 512 blocks for inode able now.

Journal size:             128M

If most of the I/O are readings while writing performance is ignored, and people are really care about space usage, the journal area can be reduced to minimum size (1024 file system blocks), for 4KB blocks Ext3, it’s 4MB: -J size=4M

By above efforts, there is around 4GB+ space back to use. If you really care about the space usage efficiency of your SSD, how about making the file system with:

mkfs.ext3 -J size=4M -m 0 -O ^resize_inode -I 16  <device>

Then you have chance to get more data blocks into usage on your expensive SSD 🙂

June 27, 2010

Random I/O — Is raw device always faster than file system ?

Filed under: File System Magic — colyli @ 8:53 am

For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems.

Their choice is reasonable,

1, Random I/O on large file cannot get any help from file system page cache.

2, <logical offset, physical offset> mapping introduces more I/O on file systems than on raw disk

3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.

The penalty for the “higher” performance is management cost, storing data on raw device introduces difficulties like,

1, Harder to backup/restore the data.

2, Cannot do more flexible management without special management tools for the raw device.

3, No convenient method to access/management the data on raw device.

The above penalties are hard to be ignored by system administrators. Further more, the store of “higher” performance is not exactly true today,

1, For file systems using block pointers for <logical offset, physical offset> mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+  pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.

2, For file systems using extent for <logical offset, physical offset> mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It’s very easy to hit a hot extent in memory for random I/O on large file.

3, If the <logical offset, physical offset> mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.

In order to verify my guess, I did some performance testing.  I share part of the data here.

Processor: AMD opteron 6174 (2.2 GHz) x 2

Memory: DDR3 1333MHz 4GB x 4

Hard disk: 5400RPM SATA 2TB x 3 [2]

File size: (create by dd, almost) 2TB

Random I/O access: 100K times read

IO size: 512 bytes

File systems: Ext3, Ext4 (with and without directio)

test tool: seekrw [3]

* With page cache

– Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r

– Performance result


Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 95.88 767.07 0 46024 0
Ext4 sdd 60.72 485.6 0 29136 0

– Wall clock time

Ext3: real time: 34 minutes 23 seconds 557537 usec

Ext4: real time: 24 minutes 44 seconds 10118 usec

* directio (without pagecache)

– Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d

– Performance result


Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 94.93 415.77 0 12473 0
Ext4 sdd 67.9 67.9 0 2037 0
Raw sdf 67.27 538.13 0 16144 0

– Wall clock time

Ext3: real time: 33 minutes 26 seconds 947875 usec

Ext4: real time: 24 minutes 25 seconds 545536 usec

sdf: real time: 24 minutes 38 seconds 523379 usec    (raw device)

From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.

The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping <logical offset, physical offset> by extent, it’s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.

Dear developers, how about considering extent based file systems now 🙂

[1] TFS, TaobaoFS. A distributed file system deployed for http://www.taobao.com . It is developed by core system team of Taobao, will be open source very soon.

[2] The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.

[3] seekrw source code can be download from http://www.mlxos.org/misc/seekrw.c

April 30, 2010

a conversation on DLM lock levels used in OCFS2

Filed under: Basic Knowledge,File System Magic — Tags: — colyli @ 9:56 am

Recently, I had a conversation with Mark Fasheh, the topic was DLM (Distributed Lock Manager) levels used in OCFS2 (Oracle Cluster File System v2). IMHO, the talk is quite useful for a starter of OCFS2 or DLM, I list the conversation here, hope it could be informative. Thank you, Mark 😉

Mark gave a simplified explanation on NL, PR and EX dlm lock levels used in OCFS2.

There are 3 lock levels Ocfs2 uses when protecting shared resources.

“NL” aka “No Lock” this is used as a placeholder. Either we get it so that we
can convert the lock to something useful, or we already had some higher level
lock and dropper to NL so another node can continue. This lock level does not
block any other nodes from access to the resource.

“PR” aka “Protected Read”. This is used to that multiple nodes might read the
resource at the same time without any mutual exclusion. This level blocks only
those nodes which want to make changes to the resource (EX locks).

“EX” aka “Exclusive”. This is used to keep other nodes from reading or changing
a resource while it is being changed by the current node. This level blocks PR
locks and other EX locks.

When another node wants a level of access to a resource which the current node
is blocking due to it’s lock level, that node “downconverts” the lock to a
compatible level. Sometimes we might have multiple nodes trying to gain
exclusive access to a resource at the same time (say two nodes want to go from
PR -> EX). When that happens, only one node can win and the others are sent
signals to ‘cancel’ their lock request and if need be, ‘downconvert’ to a mode
which is compatible with what’s being requested. In the previous example, that
means one of the nodes would cancel it’s attempt to go from PR->EX and
afterwards it would drop it’s PR to NL since the PR lock blocks the other node
from an EX.

After read the above text, I talked with Mark in IRC,  here is the edited (remove unnecessary part) conversation log,

coly: it’is an excellent material for DLM lock levels of ocfs2!
mark: specially if that helps folks understand what’s happening in dlmglue.c
* mark knows that code can be…. hard to follow  😉
mark: another thing you might want to take note of – this whole “cancel convert” business is there because the dlm allows a process to retain it’s current lock level while asking for an escalation
coly: one thing I am not clear is, what’s the functionality of dlmglue.c ? like the name, glue ?
mark: if you think about it – being forced to drop the lock and re-acquire would eliminate the possibility of deadlock, at the expense of performance
mark: think of dlmglue.c as the layer of code which abstracts away the dlm interface for the fs
mark: as part of that abstraction, file system lock management is wholly contained within dlmglue.c
coly: only dlmglue.c acts as a abstract layer ?  and the real job is done by fsdlm or o2dlm ?
mark: yes
mark: dlmglue is never actually creating resources itself – it’s asking the dlm on behalf of the file system
mark: aside from code cleanliness, dlmglue provides a number of features the fs needs that the dlm (rightfully) does not provide
coly: which kind of ?
mark: lock caching for example – you’ll notice that we keep counts on the locks in dlmglue
mark: also, whatever fs specific actions might be needed as part of a lock transition are initiated from dlmglue. an example of that would be checkpointing inode changes before allowing other nodes access, etc
coly: yeah, that’s one more thing confusing me.
coly:  It’s not clear to me yet, for the conception of upconvert and downconvert
coly: when it combined with ast and bast
coly: have you checked out the “dlmbook” pdf? it explains the dlm api (which once you understand, makes dlmglue a lot easier to figure out)
coly: yes, I read it. but because I didn’t know ast and bast before, I don’t have conception on what happens in ast and bast
coly: is it something like the signal handler ?
mark: ast and bast though are just callbacks we pass to the dlm. one (ast) is used to tell fs that a request is complete, the other (bast) is used to tell fs that a lock is blocking progress from another node
coly: when an ast is triggered, what will happen ? the node received the ast can make sure the requested lock level is granted ?
mark: generally yes. the procedure is: dlmglue fires off a request… some time later, the ast callback is run and the status it passes to dlmglue indicates whether the operation succeeded
coly: if a node receives a bast, what will happen ? I mean, are there options (e.g. release its lock, or ignore the bast) ?
mark: release the lock once possible
mark: that’s the only action that doesn’t lockup the cluster  😉
coly: I see, once a node receives a bast, it should try best to downconvert the coresponded lock to NL.
coly: it’s a little bit clear to me 🙂

I recite the log other than my own understanding, it can be helpful to get the basic conception of OCFS2’s dlm levels and what ast and bast do.

December 11, 2009

OCFS2 Truncate Log

Filed under: File System Magic — colyli @ 2:17 am

Here are some lines I copied from “OCFS2 Support Guide – On-Disk Format

Truncate Log

Truncate logs help to improve the delete performance. This system file allows the fs to collect freed bits and flush it to the global bitmap in chunks. What that also means is that space could be temporarily “lost” from the fs. As in, the space freed by deleting a large file may not show up immediately. One can view the orphan_dirs and the truncate_logs to account for such “lost” space.

To view a truncate log, do:

truncate_log_output

The truncate_log keeps records of start cluster# and number of clusters. The max number of such records “Total Records” depend on the block size. The above is for a 4K block size.

October 27, 2009

BeijingLSF 2009

Filed under: File System Magic,Great Days,kernel — colyli @ 12:31 pm

In the passed several months, I was helping to organize BeijingLSF(Beijing Linux Storage and File System Workshop) with other kernel developers in China. This great event happened on Oct 24, here is my report.

Since early this year, budget control happens in almost all companies/organizations, many local kernel developers in China could not attend LSF in United States (it’s quite interesting that most of kernel developers in China are storage related, mm, cgroup, io controller, fs, device mapper …). In such condition, we found there were enough people inside China to sit together for discussion on storage and file system related topics. In 2009 April, a proposal was posted on linux-kernel@zh-kernel.org for BeijingLSF. Many people provided positive feed back, then we started to organize this event.

A 7 persons’ committee was set up firstly, people were from Novell, Intel, Oracle, Redhat, Freescale. The committee made a 20 persons’ invitation list. The website was built on http://www.linuxevents.cn/beijinglsf and all invitees registered. Fortunately we got sponsorship from Novell China for soft drink and snack.

There were 6 sessions for BeijingLSF 2009. There was no talk, people just sat together to discuss on specific topics.

The first session was distributed locker manager. I led the session, the discussion included,

– introduced back ground of dlm, and current issues from fs/dlm (most of the issues were from my closed-or-opened and open-to-community BNCs).

– Oracle ocfs2 developers explained why ocfs2 using 64bytes lock value block.

– Jiaju Zhang (Novell) explained his patches for dlm performance improvement.

– Tao Ma (Oracle), Jiaju Zhang (Novell), Xinwei Hu (Novell) and I, discussed how dlm working with ocfs2’s user mode cluster stack.

The second session was clustering file system, led by Tao Ma (Oracle). Tao suggested people to introduce each other before the session. During the introduction, discussion happened when people introduced their current projects. When the discussion finished, 40 minutes passed. The workshop had no introduction time planed, therefore most time of this session was used for people to know each other. IMHO it was worth. This was the first time almost all storage and file system developers in China sat together, people came to know the faces behind email addresses.

The last session in the morning was shared storage and snapshot, led by Xinwei Hu (Novell). Xinwei introduced how logical volume management working in clustering environment, then discussion happened on,

– Considering snapshot start to happen on file system level, snapshot by device mapper might be less and less important in future.

– Is it possible to support snapshot by lvm in clustering environment and is it worth ? There was no conclusion from the discussion, and I’d like to hear from device mapper developers 🙂

After the lunch, the first session in the afternoon was VFS readahead and writeback. The session was led by Fengguang Wu (Intel), a 6 pages slide made people discuss for 90 minutes. Wu spent 20 minutes to introduce his patch, then people talked about,

– Why anonymous pages and file pages should be treated differently.

– In order to improvement writeback performance, MM should be able to suspend (maybe there is some better term) a process who making too many dirty pages.

– On some kind of SSD, linear read/write was slower than discrete read/write. If the storage media was SSD, there might be little difference for the writeback policy.

Second session in the afternoon was I/O controller and I/O bandwidth. The session was led by Jiangfeng Gui (Fujitsu). This talk was quite new to most of the attendees. Jianfeng explained the conception of io controller very well, at least I understood it was a software conception, not a haredware 🙂 io controller was an interesting idea, but most of concern in the workshop was focused on its complexity.

The last session of the workshop was interaction with industry. We invited an engineer from Lenovo – Yilei Lu, who was working on a internet storage solution based on Linux operating system. Yilei introduced how they used Linux as base system in their storage solution, what problems or difficulties they encountered. Many people provided suggestions to the questions, and most of the developers were very happen to hear feed back from the development users.

After all the six sessions, there was light talks. Almost all attendees said this workshop was the first effort to make upstream active developers to sit together in China. Some people showed their willing to sponsor BeijingLSF next year (if there is), some people said they could help to organize similar events in their cities. IMHO, BeijingLSF is a great and successful event. The most important thing is even not discussion, this is the *first* time for *ALL* (yes ALL) most active storage related developers within China to see each other, and have chance to talk face to face. Unlike LSF in United States, BeijingLSF has little effect to Linux storage and file system development, anyway it’s a great effort to make discussion and face-to-face communication happen.

Novell acts a very important role and contributes quite a lot to BeijingLSF. I was able to use the ITO (what a great idea!) to help organize BeijingLSF, and Novell China sponsored soft drink and snack to make all attendees more comfortable while talking whole day.

Finally please permit me to thank all attendees, they are,

Bo Yang, Coly Li, Fengguang Wu, Herbert Xu, Jeff He, Jeff Liu, Jiaju Zhang, Jianfeng Gui, Michael Fu, Tao Ma, Tiger Yang, Xiao Luo, Xinwei Hu, Xu Wang, Yang Li, Yawei Niu, Yilei Lu, Yu Zhong, Zefan Li, Zheng Yan.

Your coming and participating make BeijingLSF being a great and successful event.

Beijing Linux Storage and File System Workshop 2009

Beijing Linux Storage and File System Workshop 2009

[If you are interested on how the attendees look alike, please check http://picasaweb.google.com/colyli/BeijingLSF2009]

September 10, 2009

add extra mount option when mount ocfs2 volume by RA

Filed under: File System Magic — colyli @ 10:16 am

Just today, I know how to add extra mount options (like journal mod, or acl, or ….) when mount ocfs2 volume by pacemaker resource agent.

Type ‘crm ra info Filesystem ocf’, here is the output:

Filesystem resource agent (ocf:heartbeat:Filesystem)

Resource script for Filesystem. It manages a Filesystem on a shared storage medium.

Parameters (* denotes required, [] the default):

device* (string): block device
The name of block device for the filesystem, or -U, -L options for mount, or NFS mount specification.

directory* (string): mount point
The mount point for the filesystem.

fstype* (string): filesystem type
The optional type of filesystem to be mounted.

options (string)
Any extra options to be given as -o options to mount.

For bind mounts, add “bind” here and set fstype to “none”.
We will do the right thing for options such as “bind,ro”.

Operations’ defaults (advisory minimum):

start         timeout=60
stop          timeout=60
notify        timeout=60
monitor_0     interval=20 timeout=40 start-delay=0

Now I know add a options=”” key words in the cib configuration can make it. Thanks to Dejan Muhamedagic 🙂

August 4, 2009

Report for the Course of Dragon Star Programming

Filed under: Basic Knowledge,File System Magic,Great Days — colyli @ 12:42 pm

In July 13~17, I took a course called ‘Dragon Star’ programming. Here is my meeting report of this great event.

At the very beginning, please permit me to thank my Labs colleagues Nikanth Karthikesan and Suresh Jayaraman for their kindly help. They review the draft version of this report during their busy working hours, provide many valuable comments and make the final version’s quality much better.

Dragon Star is a series of computer science academic communications. It’s sponsored by China Natural Science Fund, to invite outstanding overseas Chinese professor to give systematic training for post graduate students. Dragon Star programming office is located at Institute of Computing Technology of Chinese Academy of Science.

In the past several years, most attendees of this program are university students, researcher from state owned institute or organization. Due to the limited seats for each course, this program seldom accepts applications from multi-national companies (like Novell). This year, it’s quite surprising that my application was approved, therefore I have a chance to know people from state owned institute/organization and exchange ideas with professor and students.

The course I applied was “Infrastructure of data-binded applications”, which was taught by Professor Xiaodong Zhang from Ohio State University. Frankly speaking, it was a seminar rather than a course, open discussions happened freely, professor was willing to hear and discuss with students. Though it was a course for post graduate students, most of the audience were PhD student, researcher and professor from universities. The venue was in Changsha Institute of Technology, where Changsha is another city in China, famous for its crazy high temperature in summer. I was surprised to know that many students in this university stayed in school during the summer holiday and also joined the training.

In the 4.5 days course, Prof Zhang touched quite a lot layers of the whole computer storage stack, from CPU L1/2/3 caches, to main memory, swapping in virtual memory and buffer cache for disk data, and finally SSD. Though many fields were mentioned, the discussion was always focused on one topic — caching. As an file systems engineer, caching is something I should know and I must know. At the end of the course, I did feel that the content of this course was much more beyond my expectation.

Here I share some of my experiences from this course.

– Most of my computer knowledge was learned by myself, for CPU caches organization, I had a misunderstood concept of fully associative cache and direct mapped cache for long. I thought, fully associative cache was much faster than direct mapped cache, because it could compare the cached value in parallel. But full associated cache was too expensive, therefore direct mapped cache was introduced. From this training, I realized direct mapped cache is much faster than full associated cache. The short coming of direct mapped cache was cache conflict handling, which is just the advantage of fully associative cache. Even a 2 way set-assocated cache can improve a lot on cache conflict because it can cache different values in these 2 ways. Then I learned a new (to me) idea to improve cache look-up speed in set-associative cache [1].

– In this training, I also re-learned the conception of row buffer in memory controller. The professor told us they observed an interesting condition. Because the row buffer only cache one page size data, when there was a cache conflict in L2/3 cache, there should be a buffer conflict in row buffer. The conflict means, new content had to replace the old content in same place. Then he introduced their work on how to solve this issue [2]. The solution was quite simple, but how they observed this issue, and how they made the analysis, this was a perfect story.

– As a Linux kernel developer, 2 talks by this professor helped me to understand the Linux virtual memory management a little bit easier. One was a method to improve page replacement policy when cache memory is full, called token-ordered LRU, which could minimize the possibility to paging thrashing [3]. Another was to avoid caching too many access-once-only pages in cache memory, which was called clock-pro[4]. I knew the token-ordered LRU for days, but this was the first time to meet one of the algorithm authors. For clock-pro, from google I knew there was patches and not upstream yet. If you are interested on these topics, please check the reference.

– Prof Zhang’s team also did research on SSD (Solid State Disk) caching. They ran different I/O pattern and got some interesting performance numbers. One of the performance numbers impressed me was, random reads and sequential reads had recoganized performance difference on Intel X25-E SSD. In the past 2 years, I took it for granted that reading from SSD were all in same speed. The explanation here was that, there was read-ahead buffer inside the SSD, therefore sequential reads is faster than random reads. Caching is everywhere!

When I stayed in Changsha, the temperature was 40+ degree Celsius. Everyday, I walked 6km between dorm and classroom (a big university) in the sunshine and high temperature, my T-shirt was wet all the time. In the evening, it was good time to read related papers and review the topics which Prof Zhang mentioned in day time. I had a feeling that if I learned these topics myself, I would have spent 3-6 months more.

The course was divided into several talks, here I list all the talks in the order of when they were taken. For details of each talk, if you are interested, please check the reference, you can find papers and slide files there. The only pitty is, that the slide is made by MS office, might not be perfectly compatible with OpenOffice.

Day 1
The top half day was overview, topic was “Balancing System Resource Supply and Demand for Effective Computing”.
The bottom half days we discussed processor cache design principles, included,
– Basic logical structure of CPU cache
– The trade-off between hit rate and access delay
– Cache design for high hit rate with low access delay
Today, the new word for me was multi-column cache. Since most of the architecture book (I read) stops at N-way set-assocaited cache, this was the first time I knew multi-column cache.

Day 2
The first half of the day started with last day’s topic, cache management in multi-core system. Prof Zhang introduced the page coloring scheme, to assign different color to different page mapping to different cache page, and pages mapping to same cache page have same color. Pages with same color are grouped into bucket. If a task is cache consumed in run time, the kernel can allocate more buckets for it. Though I didn’t understand how the buckets can cope with DMA (I guess for physical address mapped cache, DMA transfer might invalidate some content in L1/2 cache), the benchmark number shows quite some improvement when some applications are assigned with more page buckets. Details can be found from [4]. Continuing on this idea, another work was introduced, it was a database prototype called MCC-DB on LCC (last level cache) shared machine. MCC-DB intends to analyze the cache consumption of each query task, and queue the taks with knowledge of cache usage prediction. The benchmark number says this scheme can reduce query execution times by up to 33% [5].
The second half of the day introduced the structure of DRAM row buffer, and explained why when L2 cache conflict happens, row buffer would always conflict (without the improved algorithm). Then Prof Zhang introduced an algorithm to keep the locality in DRAM and avoid row buffer conflict. The algorithm was surprisingly simple, just XOR 2 parts of a physical address. The story how they found the row buffer conflict, and how to analyze the source issue, was quite interesting. Details can be found from [2].

Day 3
The first half of the day started with disk cache replacement algorithm. Introduced LRU, and analyzed the disadvantage of LRU. A major issue was, too many accessed-once-only MRU data will flush out accessed-many-times LRU data. Then Prof Zhang introduced an algorithm called LIRS( Low Inter-reference Recency Set). The basic idea of LIRS is to make multi-level queues LRU. When a page is cached, just keep it in low level queue, only when it is accessed more than once, move it into high level queue. The cache pages in low level queue are the first choice to be replaced out. The whole algorithm is (IMHO) quite complex, and I can not tell all the details without reading the paper [6] again. Though LIRS is cool, it can not replace current improved LRU implementation in Linux kernel for page cache replacement. The key issue is, in LIRS, a quite complex queue insert/remove/move operation needs a big lock to protect involved stacks, which will be a big performance penalty in kernel space. Therefore, currently LIRS is used in user space, for example postgres (if I remember correctly).
The second half of the day continued with the page cache replacement algorithm. Prof Zhang introduced an simplified algorithm called Clock-Pro [7], which was an improvement based on Clock page replacement algorithm. Right now, there is patch that implements Clock-Pro, but not upstream yet. After Clock-Pro, Prof Zhang introduced another algorithm to avoid cache thrashing which was called Token-orderd LRU[3]. Basic idea of Token-ordered LRU is, when page cache is full, new data read-in has to replace existed pages. In order to avoid thrashing, there is a token assigned to a selected process. When replacing cache pages, do not replace pages of the process who has the token. This scheme can make sure the process who has token can execute faster and finish early to release more pages. Token-ordered LRU is in upstream Linux kernel, the implementation is around 50 lines C code, perfect small.

Day 4
The first half of the day again started with disk cache replacement. For harddisk, due to seeking and read ahead buffering, reading continuous data is much faster than reading random data. Therefore, caching random data in memory might have less performance penalty than caching continuous data when page fault happens. Because 2 LBA continous blocks does not mean they are adjacent in physical layout on hard disk. Prof Zhang introduced how they make judgement on whether LBA continous blocks were physically adjacent on disk. The basic idea is, tracking last 2 access time of each cached block. If two LBA continous blocks are read-in within a small interval, these 2 blocks are physically continous on harddisk. Then when buffer cache gets full, replace the physically continous blocks first [8]. IMHO, the drawback of this algorithm is, there has to be a quite big tree structure inside kernel space to trace access time of each block. And the timestamp is made by reading TSC, I am not sure whether there is performance penalty to read TSC on per-block-reading.
The second half of the day was all for SSD. Prof Zhang spent much time to introduce the basic structure of SSD, and their work in cooperation with Intel. They did quite a lot benchmark on Intel SSD (X25-E IIRC), and observed interesting performance numbers[9]. One of the numbers impressed me was, continuous reads is faster than random reads. I asked Prof Zhang, a dirty block should be erased before next writing, was it a nature feature to implement a copy-on-write file system (though the erase block size is some how bigger than file system block size). There was no explicit answer of this question.

Day 5
Today we discussed data transfer and cache managerment on internet, especially on P2P data sharing. The Professor introduced their research to explain why P2P is the most efficient method to transfer multimedia data on internet. The basic idea is multimedia information transfer on internet is not a zipf-like distribution[10], a topology (of users’ location) optimized P2P network can cache the most accessed multimedia data in the users’ local network , which can maximize data transfer bandwidth and speed for multimedia data on internet[11].
This was the last topic of the whole course. After this talk, there was a summary speech by Dragon Star program organizers. Then the 5 days course concluded.

From the above description, though this course touched many fields from CPU cache to internet P2P protocol, there is only one thread: improve data access performance by caching. IMHO, there are 3 methods to improve system I/O performance, cache, duplicated and prefetch. This course provides a quite systematic introduction on the first solution — caching. To design a cache system (no matter where it is located in the storage stack), some key issues should be considered: cache lookup, cache conflict, and cache replacement. This course shares the experiences of a group of people (Prof Zhang’s team) on how to improve overall system performance against these issues. As a file system developer, all the content of this course perfectly hit my professional area, which brings me many new conceptions and provides me a chance to learn how experienced people think and solve problem.

Prof Zhang, please accept my sincerely regards, for your great job in these days. You were so kind to share your knowledge with us, worked so hard even with 40C+ high temperature. Wish we can meet somewhere sometime again … …
Finally I want to thank the dragon star program office to organize such a successful course, so I have chance to know different people and their great jobs. Also I should thank my employer, a great company, to send me to Changsha, give me an opportunity to have such an excited experience.

All slide files of this course can be found from http://blog.coly.li/docs/dragonstar/dragonstar.tar.bz2

Reference:
[1] C. Zhang, X. Zhang, and Y. Yan, “Two fast and high-associativity cache schemes”, IEEE Micro, Vol. 17, No. 5, September/October, 1997, pp. 40-49.
[2] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality”, Proceedings of the 33rd Annual International Symposium on Microarchitecture, (Micro-33), Monterey, California, December 10-13, 2000. pp. 32-41.
[3] Song Jiang and Xiaodong Zhang, “Token-ordered LRU: an effective page replacement policy and its implementation in Linux systems”, Performance Evaluation, Vol. 60, Issue 1-4, 2005, pp. 5-29.
[4] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan, “Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems”, Proceedings of of the 14th International Symposium on High Performance Computer Architecture (HPCA’08), Salt Lake City, Utah, February 16-20, 2008.
[5] Rubao Lee, Xiaoning Ding, Feng Chen, Qingda Lu, and  Xiaodong Zhang , “MCC-DB: minimizing cache conflicts in muli-core processors for databases”, Proceedings of 35th International Conference on Very Large Data Bases, (VLDB 2009), Lyon, France, August 24-28, 2009.
[6] Song Jiang and Xiaodong Zhang, LIRS: an efficient low inter-reference recency set replacement to improve buffer cache performance, Proceedings of the 2002 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, (SIIMETRICS’02), Marina Del Rey, California, June 15-19, 2002.
[7] Song Jiang, Feng Chen, and Xiaodong Zhang, “CLOCK-Pro: an effective improvement of the CLOCK replacement”, Proceedings of 2005 USENIX Annual Technical Conference (USENIX’05), Anaheim, CA, April 10-15, 2005.
[8] Xiaoning Ding, Song Jiang, Feng Chen, Kei Davis, and Xiaodong Zhang, “DiskSeen: exploiting disk layout and access history to enhance I/O prefetch”, Proceedings of the 2007 USENIX Annual Technical Conference, (USENIX’07), Santa Clara, California, June 17-22, 2007.
[9] Feng Chen, David Koufaty, and Xiaodong Zhang, “Understanding intrinsic characteristics and system implications of flash memory based solid state drives”, Proceedings of 2009 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems}, (SIGMETRICS/Performance 2009), Seattle, WA, June 15-19, 2009.
[10] Lei Guo, Enhua Tan, Songqing Chen, Zhen Xiao, and Xiaodong Zhang, “Does Internet media traffic really follow Zipf-like distribution?”, Proceedings of ACM SIGMETRICS’07 Conference, (Extended Abstract), San Diego, California, June 12-16, 2007.
[11] Lei Guo, Songqing Chen, and Xiaodong Zhang, “Design and evaluation of a scalable and reliable P2P assisted proxy for on-demand streaming media delivery”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 5, 2006, pp. 669-682.

Powered by WordPress