<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>An enjoyed kernel apprentice</title>
	<atom:link href="http://blog.coly.li/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.coly.li</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sun, 10 Mar 2013 15:36:35 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>openSUSE Conference 2012 in Prague</title>
		<link>http://blog.coly.li/?p=195</link>
		<comments>http://blog.coly.li/?p=195#comments</comments>
		<pubDate>Tue, 30 Oct 2012 10:59:03 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[Great Days]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=195</guid>
		<description><![CDATA[In Oct 20 ~ 23, I was invited and sponsored by openSUSE to give a talk on openSUSE Conference (OSC2012). The venue was Czech Technical University in Prague, Czech Republic, a beautiful university (without wall) in a beautiful city. It was 5 years ago since last time I visited Prague (for SuSE Labs conference), as [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://conference.opensuse.org/"><img class="alignnone  wp-image-198" style="border: 1px solid black;" title="openSUSE conference banner" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/Gplus_conference_cover.jpeg" width="592" height="114" /></a></p>
<p>In Oct 20 ~ 23, I was invited and sponsored by openSUSE to give a talk on openSUSE Conference (OSC2012). The venue was Czech Technical University in Prague, Czech Republic, a beautiful university (without wall) in a beautiful city.</p>
<p>It was 5 years ago since last time I visited Prague (for SuSE Labs conference), as well as 3 years ago since last time I attended openSUSE conference as speaker, which was OSC2009 in Nuremberg. In OSC 2009, the topic of my talk was &#8220;porting openSUSE to MIPS platform&#8221;, this was a Google summer of code project accomplished by Eryu Guan (being Redhat employee after graduated). At that time, almost all active kernel developers from China were hired by multi-national companies, few local company (not include university and institute) in China contributed patch to Linux kernel. In year 2009, after Wensong Zhang (original author of <a title="LVS project home page" href="http://www.linuxvirtualserver.org/" target="_blank">Linux Virtual Server</a>) joined <a href="http://www.taobao.com" target="_blank">Taobao</a>, this local e-business company was willing to optimize Linux kernel for their online servers and contribute patches back to Linux kernel community. IMHO, this was a small but important change in China, it should be my honor if I was able to be involved into this change. Therefore in June 2010, I left SuSE Labs and joined Taobao, to help this company to build a <a title="Taobao kernel team home page" href="http://kernel.taobao.org" target="_blank">kernel engineering team</a>.</p>
<p>From the first day since <a title="Taobao kernel team members" href="http://kernel.taobao.org/index.php/Documents/kernel_team_members" target="_blank">the team</a> was built, the team and I applicate many ideas which I learned/learn from SuSE/openSUSE kernel engineering. E.g. how to corporate with kernel community, how to organize kernel patches, how to integrate kernel patches and kernel tree with build system. After 2+ years, with great support from Wensong and other senior managers, Taobao kernel team grows to 10 persons, we contribute <a title="a patch statistic page maintained by Chen Wang" href="http://www.remword.com/kps_result/all_whole.html" target="_blank">160+ patches</a> into Linux upstream kernel, becoming one of the most active Linux kernel development teams in China. Colleagues from other departments and product lines recognize the value of Linux kernel maintenance and performance optimization, while we open <a title="Taobao kernel team accomplished projects" href="http://kernel.taobao.org/index.php/Documents/kernel_team_accomplished_projects" target="_blank">all projects information</a> and <a title="Taobao kernel git tree" href="http://kernel.taobao.org/git/?p=taobao-kernel.git;a=summary" target="_blank">kernel patches</a> to people outside the company. With the knowledge learned from openSUSE engineering, we lay a solid foundation on Taobao kernel development/maintenance procedure.</p>
<p>This time the topic of my talk is &#8220;<a title="slide file" href="http://blog.coly.li/docs/osc12-coly-taobao.pdf" target="_blank">Linux kernel development/maintenance in Taobao &#8212; what we learn from openSUSE engineering</a>&#8220;, this is an effort to say &#8220;Thank you&#8221; to <a title="link to opensuse.org" href="http://opensuse.org" target="_blank">openSUSE community</a>. Thanks to openSUSE conference organization team, I have the opportunity to introduce what we learn from openSUSE and contribute to community in past 2+ years. The <a href="http://blog.coly.li/docs/osc12-coly-taobao.pdf">slide file can be downloaded here</a>, if any one is interested on this talk.</p>
<p>Back to openSUSE conference 2 years later is a happy and sweet experience, especially meeting many old friends whom we worked together for years. I met people from YaST team, server team and SuSE Labs, as well as some ones no longer serve for SUSE but still active in opneSUSE community. Thanks to the conference organization team again, to make us have the rare and unique chance to do face-to-face communication, especially for community members like me who is not located in Europe and has to take oversea travel.</p>
<p>The conference venue in first 2 days was inside building of FIT ČVUT (Faculty of Information Technology of Czech Technical University in Prague). There were many meeting rooms available inside the build, so that dozen of talks, seminar, BOF were able to happen concurrently. I have to say, in order to accommodate 600+ registered audience, choosing such a large venue is really a great idea. In Monday the venue moved to another building, though there were less meeting room, the main room (where my talk was in) was bigger.</p>
<p>&nbsp;</p>
<p><a href="http://blog.coly.li/wp-content/uploads/2012/10/cpupower-1.jpg"><img class="alignnone size-thumbnail wp-image-201" title="CPUpower talk by Thomas Renninger" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/cpupower-1-150x150.jpg" width="150" height="150" /></a>  <a href="http://blog.coly.li/wp-content/uploads/2012/10/cpupower-2.jpg"><img class="alignnone size-thumbnail wp-image-202" title="CPUpower talk by Thomas Renninger" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/cpupower-2-150x150.jpg" width="150" height="150" /></a></p>
<p>CPU power talk by <em>Thomas Renninger</em></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2012/10/cgroup.jpg"><img class="alignnone size-thumbnail wp-image-203" title="cgroup usage by Petr Baudiš" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/cgroup-150x150.jpg" width="150" height="150" /></a></p>
<p>Cgroup usage by <em>Petr Baudiš</em></p>
<p>After talking with many speakers out of the meeting room, and chair a BOF of Linux Cgroup (control group, especially forcus on memory and I/O control), there were some non-linux-kernel talks abstracted me quite a lot. Though all the slides and video records can be found from internet (thanks to organization team again ^_^), I would like to share the talk by Thijs de Vries, which impressed me among many excellent talks.</p>
<p><a href="http://blog.coly.li/wp-content/uploads/2012/10/game-2.jpg"><img class="alignnone size-thumbnail wp-image-204" title="Thijs de Vries: Gamification - using game elements and tactics in a non-game context" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/game-2-150x150.jpg" width="150" height="150" /></a> <a href="http://blog.coly.li/wp-content/uploads/2012/10/game-3.jpg"><img class="alignnone size-thumbnail wp-image-205" title="Thijs de Vries: Gamification - using game elements and tactics in a non-game context" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/game-3-150x150.jpg" width="150" height="150" /></a> <a href="http://blog.coly.li/wp-content/uploads/2012/10/game-4.jpg"><img class="alignnone size-thumbnail wp-image-206" title="Thijs de Vries: Gamification - using game elements and tactics in a non-game context" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/game-4-150x150.jpg" width="150" height="150" /></a></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2012/10/game-5.jpg"><img class="alignnone size-thumbnail wp-image-207" title="Thijs de Vries: Gamification - using game elements and tactics in a non-game context" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/game-5-150x150.jpg" width="150" height="150" /></a> <a href="http://blog.coly.li/wp-content/uploads/2012/10/game-6.jpg"><img class="alignnone size-thumbnail wp-image-208" title="Thijs de Vries: Gamification - using game elements and tactics in a non-game context" alt="" src="http://blog.coly.li/wp-content/uploads/2012/10/game-6-150x150.jpg" width="150" height="150" /></a></p>
<p><em>Thijs de Vries</em>: Gamification &#8211; using game elements and tactics in a non-game context</p>
<p>Thijs de Vries was from a game design company (correct me if I am wrong), in this talk he explained many design principles and practices in the company. He mentioned when they planed to design a game, there were 3 objects to considerate, which in turn were project, procedure and product. A project was built for the plan, a procedure was set during the project execution, a product was shipped as the output of the project. I do like this idea for design, it&#8217;s something new and helpful to me. Then he introduced how to make people have fun, involved into the game, and understand the knowledge from the game. In Thijs&#8217; talk, it seems designing funny rules and goals is not difficult, but IMHO an educational game with funny rules and social goals is not easy to design even with every hard and careful effort. From his talk, I strongly felt innovation and genius of design (indeed not only game) from a different way which I never met and imagined before.</p>
<p>Beside orthodox conference talks, a lot conversation also happened outside the meeting room. Alexander Graf mentioned the effort to enable SUSE Linux on ARM boxes, which was a very interesting topic for people who looking for low power hardware like me. For some workload from Taobao, powerful x86 CPU does not help any more to performance, replacing them with low power ARM CPU may save lot of money on power and thermal expenditure. Currently the project seems going well, I hope the product may be shipped in the near future. Jiaju Zhang also introduced his proposal on a distributed clustering protocol which called Booth. We talked about the idea of Booth last year, it was good to see this idea came to a real project step by step. As a file system developer, some discussion about btrfs and OCFS2 happened with SuSE Labs people as well. For btrfs it was unanimous that this file system was not ready for large scale deployment yet, people from Fujitsu, Oracle, SuSE, Redhat, and other organizations were working hard to improve the quality to product usage. For OCFS2, we talked about file system freeze among cluster, there was little initial effort since last 2 years, a very incipient idea was discussed on how to freeze write I/O among each node in the cluster. It seems OCFS2 is in maintenance status currently, hope someday I (or someone else) have time and interest to work on this interesting and useful feature.<br />
This article just part of my experience from openSUSE conference. OSC2012 was well organized, included but not limited to schedule, venue, video record, meal, travel, hotel, .etc. Here I should thank several people who help me to attend the great conference once again,</p>
<ul>
<li>People behind cfp@opensuse.org, who accept my proposal</li>
<li>People behind travel-support@opensuse.org, who kindly offer the sponsorship for my travel</li>
<li>Stella Rouzi, who helped me on visa application</li>
<li>Andreas Jaeger, Lars Muller, and other people who encourage me to give a talk on OSC2012.</li>
<li>Alexader Graf and others who review my slide</li>
</ul>
<p>Finally, if you have interest to find more information about openSUSE conference 2012, these URL may be informative,</p>
<blockquote><p>Conference schedule: <a title="OSC2012 schedule" href="http://bootstrapping-awesome.org/schedule/" target="_blank">http://bootstrapping-awesome.org/schedule/</a><br />
Conference video: <a href="http://en.opensuse.org/Archive:Conference_video_2012" rel="nofollow">http://en.opensuse.org/Archive:Conference_video_2012</a><br />
Slide of my talk: <a title="Slide file of my talk" href="http://blog.coly.li/docs/osc12-coly-taobao.pdf" target="_blank">http://blog.coly.li/docs/osc12-coly-taobao.pdf</a><br />
Video of my talk: <a title="Video stream of my talk" href="http://blip.tv/openSUSEtv/osc12-kernel-development-maintenance-in-taobao-6415082" target="_blank">http://blip.tv/openSUSEtv/osc12-kernel-development-maintenance-in-taobao-6415082</a></p></blockquote>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=195</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>alloc_sem of Ext4 block group</title>
		<link>http://blog.coly.li/?p=182</link>
		<comments>http://blog.coly.li/?p=182#comments</comments>
		<pubDate>Mon, 07 Feb 2011 17:53:10 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[File System Magic]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=182</guid>
		<description><![CDATA[Yesterday Amir Goldstein sent me an email for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, he provides a quite clear answer. I feel Ted&#8217;s answer is also very informative to [...]]]></description>
				<content:encoded><![CDATA[<p>Yesterday Amir Goldstein sent me an email for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, he provides a quite clear answer. I feel Ted&#8217;s answer is also very informative to me, I copy&amp;past the conversation from linux-ext4@vger.kernel.org to my blog. The copy rights of the bellowed referenced text belong to their original authors.</p>
<blockquote><p>On Sun, Feb 06, 2011 at 10:43:58AM +0200, Amir Goldstein wrote:<br />
&gt; When looking at alloc_sem, I realized that it is only needed to avoid<br />
&gt; race with adjacent group buddy initialization.<br />
Actually, alloc_sem is used to protect all of the block group specific<br />
data structures; the buddy bitmap counters, adjusting the buddy bitmap<br />
itself, the largest free order in a block group, etc.  So even in the<br />
case where block_size == page_size, alloc_sem is still needed!<br />
- Ted</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=182</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Three Practical System Workloads of Taobao</title>
		<link>http://blog.coly.li/?p=177</link>
		<comments>http://blog.coly.li/?p=177#comments</comments>
		<pubDate>Tue, 23 Nov 2010 03:44:08 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[kernel]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=177</guid>
		<description><![CDATA[Days ago, I gave a talk on an academic seminar at ACT of Beihang University (http://act.buaa.edu.cn/). In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed product lines. The introduction was quite brief, no detail touched here. we don&#8217;t mind to [...]]]></description>
				<content:encoded><![CDATA[<p>Days ago, I gave a talk on an academic seminar at ACT of Beihang University (http://act.buaa.edu.cn/). In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed product lines. The introduction was quite brief, no detail touched here. we don&#8217;t mind to share what we did imperfectly, and we would like to open mind to cooperate with open source community and industries to improve <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>If you find there is anything unclear or misleading, please let me know. Communication makes things better most of time <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>[<a href="http://blog.coly.li/docs/Challenge_of_Taobao_Workload.pdf" target="_blank">The slide file can be found here</a>]</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=177</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>China Linux Storage and File System Workshop 2010</title>
		<link>http://blog.coly.li/?p=144</link>
		<comments>http://blog.coly.li/?p=144#comments</comments>
		<pubDate>Sat, 16 Oct 2010 18:41:26 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[File System Magic]]></category>
		<category><![CDATA[Great Days]]></category>
		<category><![CDATA[kernel]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=144</guid>
		<description><![CDATA[[CLSF 2010, Oct 14~15, Intel Zizhu Campus, Shanghai, China] Similar to Linux Storage and File System Summit in north America, China Linux Storage and File System Workshop is a chance to make most of active upstream I/O related kernel developers get together and share their ideas and current status. We (CLSF committee) invited around 26 persons [...]]]></description>
				<content:encoded><![CDATA[<p>[CLSF 2010, Oct 14~15, Intel Zizhu Campus, Shanghai, China]</p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_7.jpg"><img class="alignnone size-full wp-image-170" title="ChinaLSF 2010" src="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_7.jpg" alt="" width="600" height="450" /></a></p>
<p>Similar to Linux Storage and File System Summit in north America, China Linux Storage and File System Workshop is a chance to make most of active upstream I/O related kernel developers get together and share their ideas and current status.</p>
<p>We (CLSF committee) invited around 26 persons to China LSF 2010, including community developers who contribute to Linux I/O subsystem, and engineers who develop their storage products/solutions based on Linux. In order to reduce travel cost to all attendees, we decided to co-locate China LSF with CLK (China Linux Kernel Developers Conference) in Shanghai.</p>
<p>This year, Intel OTC (Opensource Technology Center) contributed a lot to the conference organization. She kindly provided free and comfortable conference room, donated employees to help the organization and preparation, two intern students acted as volunteers helping on many trivial stuffs.</p>
<p>CLSF2010 is a two days&#8217; conference,  here are some interesting topics (IMHO) which I&#8217;d like to share on my blog. I don&#8217;t understand very well on every topic, if there is any error/mistake in this text, please let me know. Any errata is welcome <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<blockquote><p>&#8211; Writeback, led by Fengguang Wu</p>
<p>&#8211; CFQ, Block IO Controller &amp; Write IO Controller, led by Jianfeng Gui, Fengguang Wu</p>
<p>&#8211; Btrfs, led by Coly Li</p>
<p>&#8211; SSD &amp; Block Layer, led by Shaohua Li</p>
<p>&#8211; VFS Scalability, led by Tao Ma</p>
<p>&#8211; Kernel Tracing, led by Zefan Li</p>
<p>&#8211; Kernel Testing and Benchmarking, led by Alex Shi</p></blockquote>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_4.jpg"><img class="alignnone size-full wp-image-171" title="Discussion" src="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_4.jpg" alt="" width="600" height="450" /></a></p>
<p>Beside the above topics, we also had &#8216;From Industry&#8217; sessions, engineers from Baidu, Taobao and EMC shared their experience when building their own storage solutions/products based on Linux.</p>
<p>In this blog, I&#8217;d like to share the information I got from CLSF 2010, hope it could be informative <img src='http://blog.coly.li/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<h3>Write back</h3>
<p>The first session started from Write back,  which is quite hot recently. Fengguang does quite a few work on it, and kindly volunteer to lead this session.</p>
<p>An idea was brought out to limit the dirty page ratio by per-process. Fengguang made a patch and shared a demo picture with us. When dirty pages exceeds the up-limit specified to a process, kernel will write back the dirty pages of this process smoothly, until the dirty page numbers reduced to a pre-configured rate. This idea is helpful to processes hold a large number of dirty pages.  Some people concerned this patch didn&#8217;t help the condition that a lot of processes and each hold a few dirty pages. Fengguang replied for server application, if this condition happened, the design might be buggy.</p>
<p>People also mentioned now the erase block size of SSD increased from KBs to MBs, adopting a bigger page numbers in writing out may help on the whole file system performance. Engineers from Baidu shared their experience,</p>
<blockquote><p>&#8211; Increase the write out size from 4MB to 40MB, they achieved 20% performance improvement.</p>
<p>&#8211; Use extent based file system, they got better continuous on-disk layout and less memory consume for metadata.</p></blockquote>
<p>Fengguang also shared his idea on how to control process to write pages, the original idea was control dirty pages by I/O (calling writeback_inode(dirtied * 3/2))， after several times improvement it became wait_for_writeback(dirteid/throttle_bandwidth). By this means, the I/O bandwidth of dirty pages to a process also got controlled.</p>
<p>During the discussion, Fengguang pointed out the event that a page got dirty was more important than whether a page was dirty. Engineers from Baidu said, in order to avoid a kernel/user space memory copy during file read/write, while using kernel page cache, they used mmap to read/write file pages other than calling read/write syscalls. In this case, a page writable in mmap is initialized as read only firstly, when the writing happened a page fault was triggered, then kernel knew this page got dirty.</p>
<p>It seems many ideas are under working to improve the writeback performance, including active writeback in back group, and some cooperation with underlying block layer. My current focus is not here, anyway I believe people in the room could help a bit out <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_6.jpg"><img class="alignnone size-full wp-image-174" title="Discussion" src="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_6.jpg" alt="" width="600" height="450" /></a></p>
<h3>Btrfs</h3>
<p>Recently, there are many developers in China start to work on btrfs, e.g. Xie Miao, Zefan Li, Shaohua Li, Zheng Yan, &#8230; Therefore we specially arranged a two hours session for btrfs. The main purpose of the btrfs session is to share what we are doing on btrfs.</p>
<p>Most of people agreed that btrfs needed a real fsck tool now. Engineers from Fujitsu said they had a plan to invest people on btrfs checking tool development. Miao Xie, Zefan Li, Coly Li and other developers suggested to consider the pain of fsck from beginning,</p>
<blockquote><p>&#8211; memory consuming</p>
<p style="padding-left: 30px;">Now a 10TB+ storage media is cheap and common, for large file system built on them, doing fsck needs more memory to hold meta data (e.g. bitmap, dir blocks, inode blocks, btree internal blocks &#8230;). For online fsck, consuming too many memory in file system checking will have negative performance impact to page cache or other applications. For offline fack, it was not a problem, now online fsck is coming, we have to encounter this open question now <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>&#8211; fsck speed</p>
<p style="padding-left: 30px;">A tree structured file system has (much) more meta data than a table structured file system (like Ext2/3/4), which may mean more I/O and more time. For a 10TB+ 80% full file system, how to reduce the file system checking time will be a key issue, especially for online service workload. I proposed an solution, allocating metadata to SSD or other higher seek speed device, then checking on metadata may have no (or a little) seeking time, which results a faster file system checking.</p>
<p style="padding-left: 30px;">Weeks before, two intern students Kunshan Wang and Shaoyan Wang, they worked with me, wrote a very basic patch set (including kernel and user space code), to allocate metadata from a higher seek time device. This patch set is compiling passed, the students did a quite basic verification on meta data allocation, the patch worked. I don&#8217;t review the patch yet, by a quite rough code checking, there is much improvement needed. I post this draft patch set to China LSF mailing list, to call for more comments from CLSF attendees. Hope next month,  I can have time to improve the great job by Kunshan and Shaoyan.</p>
</blockquote>
<p>Zefan Li said there was a todo list of btrfs, a long term task was data de-duplication, and a short term task was allocating data from SSD. Herbert Xu pointed out, the underlying storage media impacted file system performance quite a lot, from a benchmark from Ric Wheeler of Redhat, on Fusion IO high end PCI-E SSD, there is almost no performance difference between well known file system like xfs, ext2/3/4 or btrfs.</p>
<p>People also said that these days, the code review or merge of btrfs patches were often delayed, it seemed btrfs maintainer was too busy to handle the community patches. There was reply from the maintainer that the condition will be improved and patches would be handled in time, but there was no obvious improvement so far. I can understand when a person has more emergent task like kernel tree maintenance, he or she does have difficulty to handle non-trivial patches in time if this is not his or her highest priority job. From CLSF, I find more and more Chinese developers start to work on btrfs, I hope they should be patient if their patches don&#8217;t get handled in time <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Engineers from Intel OTC mentioned there is no btrfs support from popular boot loader like Grub2. For me, IIRC there is someone working on it, and the patches are almost ready. Shaohua mentioned why not loading the Linux kernel by a linux kernel, like the kboot project does. People pointed out there still should be something to load the first Linux kernel, this was a chicken-and-egg question <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  My point was, it should not be very hard to enable the btrfs support in boot loader, a small Google Summer of Code project could make it. I&#8217;d like to port and merge the patches (if they are available) to openSUSE since I maintain openSuSE grub2 package.</p>
<p>Shaohua Li shared his experience on btrfs development for Meego project, he did some work on fast boot and read ahead on btrfs. Shaohua said there was some performance advance observed on btrfs, and the better result was achieved by some hacking, like a big read ahead size, a dedicated work queue to handle write request and using a big write back size. Fengguang Wu and Tao Ma pointed out this might be a general hacking, because Ext4 and OCFS2 also did the similar hacking for better performance.</p>
<p>Finally Shaohua Li pointed out there was a huge opportunity to improve the scalability of btrfs, since there still were many global locking, cache missing existing in current code.</p>
<h3>SSD &amp; Block Layer</h3>
<p>This was a quite interesting session led by Shaohua Li. Shaohua started the session by some observed problems between SSD and block layer,</p>
<blockquote><p>&#8211; Throughput is high, like network</p>
<p>&#8211; Disk controller gap, no MSI-x&#8230;</p>
<p>&#8211; Big locks, queue lock, scsi host lock, &#8230;</p></blockquote>
<p>Shaohua shared some benchmark result showed that for high IOPS the interrupt over loaded on a single CPU,  even on a multi processors system, the interrupts could not be balanced to multi processors, which was a bottleneck to handle interrupts invoked by I/O of SSD.  If a system had 4 SDDs, a processor ran 100% to handle the interrupts and how throughput was around 60%-80%.</p>
<p>A workaround here was polling. Replacing interrupt by blk_iopoll could help the performance number, which could reduce processor overload on interrupts handling. However, Herbert Xu points out the key issue was current hardware didn&#8217;t support multi-queue to handle same interrupts. Different interrupts could be balanced to every processor in the system, but unlike network hardware, same interrupt could not be balanced into multi-queue and only be handled by a single processor. A hardware multi-queue support should be the silver bullet.</p>
<p>For SSD like Fusion IO produces, the IOPS could be one million + IOPS on a single SSD device, the parallel load is much more higher than on traditional hard disk. Herbert, Zefan and I agreed that some hidden race defect should be observed very soon.</p>
<p>Right now, block layer is not ready for such high parallel I/O load.  Herbert Xu pointed out that lock contention might be a big issue to solve. The source of the lock contention was cache consistence cost for global resource which protected by locking. Convert the global resource to a per-CPU local data might be a direction to solve the locking contention issue. Since Jens and Nick can touch Fusion IO devices more conveniently, we believe they can work with other developers to help out a lot.</p>
<h3>Kernel Tracing</h3>
<p>Zefan Li helped to lead an interesting session about kernel tracing. I don&#8217;t have any real understanding for any kernel trace infrastructure, for me the only tool is printk(). IMHO printk is the best trace/debug tool for kernel programming. Anyway, debugging is always an attractive topic to curious programmer, and I felt Zefan did his job quite well <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>The OCFS2 developer Tao Ma, mentioned OCFS2 currently using a printk wrapper trace code, which was not flexible and quite obsolete, OCFS2 developers were thinking of using a trace infrastructure like ftrace.</p>
<p>Zefan pointed out using ftrace to replace previous printk based trace messages should be careful, there might be ABI (application binary interface) issue for user space tools. Some user space tools work with kernel message (one can check kernel message with kmesg command). An Intel engineer mentioned there was accident recently that a kernel message modification caused the powertop tools didn&#8217;t work correctly.</p>
<p>For file system trace, the situation might be easier. Because most of the trace info was used by file system developers or testers, the one adding trace info into file system code might ignore the ABI issue with happy. Anyway, it was just &#8220;might&#8221;, not &#8220;be able to&#8221;.</p>
<p>Zefan said there was patch introduced TRACE_EVENT_ABI, if some trace info could form a stable user space ABI they could be announced by TRACE_EVENT_ABI.</p>
<p>This session also discussed how ftrace working. Now I know the trace info stored in a ring buffer. If ftrace is enabled but the ring buffer is not, user is still not able to receive trace info. People also said that a user space trace tool would be necessary.</p>
<p>Someone said perf tool currently getting more and more powerful, it was probably that integrating trace function into perf. Linux kernel only needs one trace tool,  some people in this workshop think it might be perf (for me, I have no point, because I use neither).</p>
<p>Finally Herbert again suggested people to pay attention on scalability issues when adding trace point. Currently the ring buffer was not a per-CPU local area, adding trace point might introduce performance regression for existing optimized code.</p>
<h3>From Industry</h3>
<p>In last year&#8217;s BeijingLSF, we invited two engineers from Lenovo. They shared their experience using Linux as the base system for their storage solution. This session had a quite positive feed back, and all committee member suggested to continue the From Industry sessions again this year.</p>
<p>For ChinaLSF2010, we invited 3 companies to share their ideas with other attendees, engineers from Baidu, Taobao and EMC  led three interesting sessions, people had chance to know which kind of difficulties they encountered, how they solved the problems and what they achieved from their solution or work around. Here I share some interesting points on my blog.</p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_2.jpg"><img class="alignnone size-full wp-image-172" title="From Industry I" src="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_2.jpg" alt="" width="600" height="450" /></a></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_5.jpg"><img class="alignnone size-full wp-image-173" title="From Industry II" src="http://blog.coly.li/wp-content/uploads/2010/10/ChinaLSF2010_5.jpg" alt="" width="600" height="450" /></a></p>
<h4>From Taobao</h4>
<p>Engineers from Taobao also shared their works based on Linux storage and file systems,  the projects were Tair and TFS.</p>
<p>Tair is a distributed cache system used inside Taobao, TFS is a distributed user space file system to store Taobao goods pictures.  For detail information, please check <a href="http://code.taobao.org">http://code.taobao.org</a> <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<h4>From EMC</h4>
<p>Engineers from EMC shared their work on file system recovery, especially file system checking. Tao Ma and I, we also mentioned what we did in fsck.ocfs2 (ocfs2 file system checking tool). The opinion from EMC was, even an online file system checking was possible, the offline fsck was still required. Because an offline file system checking could check and fix a file system from a higher level scope.</p>
<p>Other points were also discussed in previous sessions, including memory occupation, time consuming &#8230;</p>
<h4>From Baidu</h4>
<p>This was the first time I knew people from Baidu, and had chance to knew what they did on Linux kernel. Thanks to Baidu kernel team, we had opportunity to know what they did in the past years.</p>
<p>Guangjun Xie from Baidu started the session by introducing Baidu&#8217;s I/O workload, most of the I/O were indexing and distributed computing related, reading performance was more desired then writing performance. In order to reduce memory copying in data reading, they used mmap to read data pages from underlying media to page cache.  Accessing the page via mmap could not use the advantage of Linux kernel page cache replacement algorithm, while Baidu didn&#8217;t want to implement a similar page cache within user space. Therefore they used a not-beautiful-but-efficient workaround, they implemented an in-house system call, the system call updated the page (returned by mmap) in kernel&#8217;s page LRU. By this means, the data page could be management by kernel&#8217;s page cache code. Some people pointed out this was mmap() + read ahead. From Baidu&#8217;s benchmark their effort increased 100% searching workload performance on a single node server.</p>
<p>Baidu also tried to use bigger block size of Ext2 file system, to make data block layout more continuous, also from their performance data the bigger block size also resulted a better I/O performance. IMHO, a local mod ocfs2 file system may achieve a similar performance, because the basic block unit of ocfs2 is a cluster, the cluster size could be from 4KB to 1MB.</p>
<p>Baidu also tried to compress/decompress the data when writing/reading from disk, since most of Baidu&#8217;s data was text, the compress rate was quite satisfied high. They even used a PCIE compressing card, the performance result was pretty good.</p>
<p>Guangjun also mentioned, when they used SATA disks, some I/O error was silence error, for meta data, this was a fatal error, at least meta data checksum was necessary. For data checksum, they did it in application level.</p>
<h3>Conclusion</h3>
<p>Now comes to the last part of this blog, let me give my own conclusion to ChinaLSF 2010 <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>IMHO, the organization and preparation this year is much better than BeijingLSF 2009, people from Intel Shanghai OTC contribute a lot of time and effort before/during/after the workshop, without their effort, we can not have such a successful event. Also a big thank you should give our sponsor EMC China, they not only sponsor conference expense, but also send engineers to share their development experience.</p>
<p>Let&#8217;s wait for next year for ChinaLSF 2011 <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=144</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Don&#8217;t waste your SSD blocks</title>
		<link>http://blog.coly.li/?p=124</link>
		<comments>http://blog.coly.li/?p=124#comments</comments>
		<pubDate>Mon, 26 Jul 2010 05:20:38 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[File System Magic]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=124</guid>
		<description><![CDATA[These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was, Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb1 77418272 184216 73301344 1 /mnt As well as from fdisk output, it said, Device Boot Start End Blocks Id [...]]]></description>
				<content:encoded><![CDATA[<p>These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was,</p>
<p><!--   		BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Liberation Sans"; font-size:x-small } --></p>
<table border="0" cellspacing="0" frame="void" rules="none">
<colgroup span="1">
<col span="1" width="90"></col>
<col span="1" width="90"></col>
<col span="1" width="69"></col>
<col span="1" width="65"></col>
<col span="1" width="89"></col>
<col span="1" width="104"></col>
</colgroup>
<tbody>
<tr>
<td width="90" height="17" align="right"><span style="font-family: Droid Serif;">Filesystem</span></td>
<td width="90" align="right"><span style="font-family: Droid Serif;">1K-blocks</span></td>
<td width="69" align="right"><span style="font-family: Droid Serif;">Used</span></td>
<td width="65" align="right"><span style="font-family: Droid Serif;">Available</span></td>
<td width="89" align="right"><span style="font-family: Droid Serif;">Use%</span></td>
<td width="104" align="right"><span style="font-family: Droid Serif;">Mounted on</span></td>
</tr>
<tr>
<td height="17" align="right"><span style="font-family: Droid Serif;">/dev/sdb1</span></td>
<td align="right"><span style="font-family: Droid Serif;">77418272</span></td>
<td align="right"><span style="font-family: Droid Serif;">184216</span></td>
<td align="right"><span style="font-family: Droid Serif;">73301344</span></td>
<td align="right"><span style="font-family: Droid Serif;">1</span></td>
<td align="right"><span style="font-family: Droid Serif;">/mnt</span></td>
</tr>
</tbody>
</table>
<p>As well as from fdisk output, it said,</p>
<p><!--   		BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Liberation Sans"; font-size:x-small } --></p>
<table border="0" cellspacing="0" frame="void" rules="none">
<colgroup span="1">
<col span="1" width="90"></col>
<col span="1" width="90"></col>
<col span="1" width="69"></col>
<col span="1" width="65"></col>
<col span="1" width="89"></col>
<col span="1" width="104"></col>
<col span="1" width="81"></col>
</colgroup>
<tbody>
<tr>
<td width="90" height="17" align="right"><span style="font-family: Droid Serif;">Device</span></td>
<td width="90" align="right"><span style="font-family: Droid Serif;">Boot</span></td>
<td width="69" align="right"><span style="font-family: Droid Serif;">Start</span></td>
<td width="65" align="right"><span style="font-family: Droid Serif;">End</span></td>
<td width="89" align="right"><span style="font-family: Droid Serif;">Blocks</span></td>
<td width="104" align="right"><span style="font-family: Droid Serif;">Id</span></td>
<td width="81" align="right"><span style="font-family: Droid Serif;">System</span></td>
</tr>
<tr>
<td height="17" align="right"><span style="font-family: Droid Serif;">/dev/sdb1</span></td>
<td align="right"><span style="font-family: Droid Serif;"><br />
</span></td>
<td align="right"><span style="font-family: Droid Serif;">7834</span></td>
<td align="right"><span style="font-family: Droid Serif;">17625</span></td>
<td align="right"><span style="font-family: Droid Serif;">78654240</span></td>
<td align="right"><span style="font-family: Droid Serif;">83</span></td>
<td align="right"><span style="font-family: Droid Serif;">Linux</span></td>
</tr>
</tbody>
</table>
<p>From his observation, before format the SSD, there was 78654240 1k blocks available on the partition, after the format, 77418272 1k blocks could be used, which means almost 1G space unused from the partition.</p>
<p>A more serious question was, from the output of df, used blocks + available blocks = 73485560, but the file system had 77418272 blocks &#8212; 4301144 1k blocks disappeared ! This 160G SSD costs him 430USD, he complained around 15USD was payed for nothing.</p>
<p>IMHO, this is a quite interesting question, and asked by many people for many times. This time, I&#8217;d like to spend some time to explain how the blocks are wasted, and how to make better usage of every block on the SSD (since it&#8217;s quite expensive).</p>
<p>First of all, better storage usage depends on the I/O pattern in practice. This SSD is used to store large file for random I/O, especially most of the I/O (99%+) is reading on random file offset, the writing can almost be ignored. Therefore, it is wanted to use every available block to store a very big files on the Ext3 file systems.</p>
<p>If only using the default command line to format an Ext3 file system like &#8220;mkfs.ext3 /dev/sdb1&#8243;, mkfs.ext3 will do the following things for block allocation,</p>
<p>- Allocates reserved blocks for root user, to avoid non-privilege users using up all disk space.</p>
<p>- Allocates metadata like superblock, backed superblock, block group descriptors, block bitmap for each block group, inode bitmap for each block group, inode table for each block group.</p>
<p>- Allocates reserved block group blocks for offline file system extension.</p>
<p>- Allocates blocks for journal</p>
<p>Since the SSD is only for data storage, no operation system installed on it, and writing performance is disregarded here, and no requirement for further file system size extension, and only a few files are stored on the file systems, some blocks allocation is unnecessary and useless,</p>
<p>- Journal blocks</p>
<p>- Inodes blocks</p>
<p>- Reserved group descriptor blocks for file system resize</p>
<p>- Reserved blocks for root user</p>
<p>Let&#8217;s run dumpe2fs to see how many blocks are wasted on the above items, I only list part of the output (outlines) here,</p>
<blockquote><p>&gt; dumpe2fs /dev/sdb1</p></blockquote>
<blockquote><p>Filesystem volume name:   &lt;none&gt;<br />
Last mounted on:          &lt;not available&gt;<br />
Filesystem UUID:          f335ba18-70cc-43f9-bdc8-ed0a8a1a5ad3<br />
Filesystem magic number:  0xEF53<br />
Filesystem revision #:    1 (dynamic)<br />
Filesystem features:      has_journal ext_attr <strong><span style="color: #ff0000;">resize_inode</span></strong> dir_index filetype needs_recovery sparse_super large_file<br />
Filesystem flags:         signed_directory_hash<br />
Default mount options:    (none)<br />
Filesystem state:         clean<br />
Errors behavior:          Continue<br />
Filesystem OS type:       Linux<br />
Inode count:              4923392<br />
Block count:              19663560<br />
<strong><span style="color: #ff0000;">Reserved block count:     983178</span></strong><br />
Free blocks:              19308514<br />
Free inodes:              4923381<br />
First block:              0<br />
Block size:               4096<br />
Fragment size:            4096<br />
<strong><span style="color: #ff0000;">Reserved GDT blocks:      1019</span></strong><br />
Blocks per group:         32768<br />
Fragments per group:      32768<br />
<span style="color: #ff0000;"><strong>Inodes per group:         8192<br />
Inode blocks per group:   512</strong><br />
</span>Filesystem created:       Tue Jul  6 21:42:32 2010<br />
Last mount time:          Tue Jul  6 21:44:42 2010<br />
Last write time:          Tue Jul  6 21:44:42 2010<br />
Mount count:              1<br />
Maximum mount count:      39<br />
Last checked:             Tue Jul  6 21:42:32 2010<br />
Check interval:           15552000 (6 months)<br />
Next check after:         Sun Jan  2 21:42:32 2011<br />
Reserved blocks uid:      0 (user root)<br />
Reserved blocks gid:      0 (group root)<br />
First inode:              11<br />
<strong><span style="color: #ff0000;">Inode size:               256</span></strong><br />
Required extra isize:     28<br />
Desired extra isize:      28<br />
Journal inode:            8<br />
Default directory hash:   half_md4<br />
Directory Hash Seed:      3ef6ca72-c800-4c44-8c77-532a21bcad5a<br />
Journal backup:           inode blocks<br />
Journal features:         (none)<br />
<strong><span style="color: #ff0000;">Journal size:             128M<br />
</span></strong>Journal length:           32768<br />
Journal sequence:         0&#215;00000001<br />
Journal start:            0</p></blockquote>
<blockquote><p>Group 0: (Blocks 0-32767)<br />
Primary superblock at 0, Group descriptors at 1-5<br />
<strong><span style="color: #ff0000;">Reserved GDT blocks at 6-1024</span></strong><br />
Block bitmap at 1025 (+1025), Inode bitmap at 1026 (+1026)<br />
<strong><span style="color: #ff0000;">Inode table at 1027-1538 (+1027)</span></strong><br />
31223 free blocks, 8181 free inodes, 2 directories<br />
Free blocks: 1545-32767<br />
Free inodes: 12-8192</p>
<p>[snip ....]</p></blockquote>
<p>The file system block size is 4KB, which is different from the output block size of df and fdisk. In the above output, I mark the outlines with <strong><span style="color: #ff0000;">RED</span></strong> color. Now let&#8217;s look at the line for reserved block,</p>
<blockquote><p><strong><span style="color: #ff0000;">Reserved block count:     983178</span></strong></p></blockquote>
<p>These 983178 4K blocks are served for root user, since the system and user home is not on SSD, we don&#8217;t need to reserve these blocks.  Read mkfs.ext3(8), there is a parameter &#8216;-m&#8217; to set reserved-blocks-percentage, set &#8216;-m 0&#8242; to reserve zero block for privilege user.</p>
<p>From file system features line, we can see resize_inode is one of the default enabled feature,</p>
<blockquote><p>Filesystem features:      has_journal ext_attr <span style="color: #ff0000;"><strong>resize_inode</strong></span> dir_index filetype needs_recovery sparse_super large_file</p></blockquote>
<p>resize_inode feature reserves quite a lot blocks for new extended block group descriptors, these blocks can be found from lines like,</p>
<blockquote><p><strong><span style="color: #ff0000;">Reserved GDT blocks at 6-1024</span></strong></p></blockquote>
<p>When resize_inode feature enabled, mkfs.ext3 will reserve some blocks after block group descriptor blocks, called &#8220;Reserved GDT blocks&#8221;.  If file system will be extended in future (e.g. the file system is created on a logical volume), these reserved blocks can be used for new block group descriptors. Now the storage media is SSD, not file system extension in future, we don&#8217;t have to pay money (on SSD, blocks means money) for this kind of blocks. To disable resize_inode feature, use &#8220;-O ^resize_inode&#8221; in mkfs.ext3(8).</p>
<p>Then look at these 2 lines for inode blocks,</p>
<blockquote><p><strong><span style="color: #ff0000;">Inodes per group:         8192<br />
Inode blocks per group:   512</span></strong><span style="color: #ff0000;"><br />
</span></p></blockquote>
<p>We only store no more than 5 files on the whole file systems,  but here 512 blocks in each block groups are allocated for inode table. There are 601 block groups, which means 512&#215;601=307712 blocks (≈ 1.2GB space) wasted for inode tables.  Using &#8216;-N 16&#8242; in mkfs.ext3(8) to specify only 16 inodes in the file system, though mkfs.ext3(3) at least allocate one inode table block in each block group (more then 16 inodes), we only wast 1 block other than 512 blocks for inode able now.</p>
<blockquote><p><strong><span style="color: #ff0000;">Journal size:             128M</span><br />
</strong></p></blockquote>
<p>If most of the I/O are readings while writing performance is ignored, and people are really care about space usage, the journal area can be reduced to minimum size (1024 file system blocks), for 4KB blocks Ext3, it&#8217;s 4MB: -J size=4M</p>
<p>By above efforts, there is around 4GB+ space back to use. If you really care about the space usage efficiency of your SSD, how about making the file system with:</p>
<blockquote><p>mkfs.ext3 -J size=4M -m 0 -O ^resize_inode -I 16  &lt;device&gt;</p></blockquote>
<p>Then you have chance to get more data blocks into usage on your expensive SSD <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=124</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Taobao joins open source</title>
		<link>http://blog.coly.li/?p=112</link>
		<comments>http://blog.coly.li/?p=112#comments</comments>
		<pubDate>Wed, 30 Jun 2010 16:27:57 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[Great Days]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=112</guid>
		<description><![CDATA[Today, Taobao announces its open source community  &#8212; http://code.taobao.org. This is a historical day, a China local  internet and e-business leading company joins open source world by its practice approved activity. The first project released on code.taobao.org is TAIR. Tair is a distributed, high performance key/value storage system, using in Taobao&#8217;s infrastructure for time.  Taobao [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://code.taobao.org"><img class="alignnone size-thumbnail wp-image-115" title="Taobao Code" src="http://blog.coly.li/wp-content/uploads/2010/06/logo_en-150x40.gif" alt="Taobao's open source commnity" width="150" height="40" /></a></p>
<p><a href="http://code.taobao.org"><img class="alignnone size-thumbnail wp-image-116" title="Taobao Open Source" src="http://blog.coly.li/wp-content/uploads/2010/06/logo_ch-150x40.gif" alt="Taobao's open source community" width="150" height="40" /></a></p>
<p>Today, Taobao announces its open source community  &#8212; http://code.taobao.org.</p>
<p>This is a historical day, a China local  internet and e-business leading company joins open source world by its practice approved activity.</p>
<p>The first project released on code.taobao.org is TAIR. Tair is a distributed, high performance key/value storage system, using in Taobao&#8217;s infrastructure for time.  Taobao is on the way to make more internal projects to be open source. Yes, talk is cheap, show the code !</p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/06/first_page1.jpg"><img class="alignnone size-full wp-image-122" title="first page on code.taobao.org" src="http://blog.coly.li/wp-content/uploads/2010/06/first_page1.jpg" alt="" width="800" height="526" /></a></p>
<p>If you are working on large scale website, with more than 10K server nodes, checking projects on code.taobao.org may help you to avoid making another wheel. Please, visit http://code.taobao.org, and join the community to contribute. I believe people can improve the community better and better. Currently, most of the expect developers are Chinese spoken, that&#8217;s why you can find square characters on the website. I believe more changes will come in future, because the people behind the community like continuously improvement <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Of cause there are some other contributions to open source community from Taobao can not be found on code.taobao.org, For example, I believer patches from Taobao will appear in Linux kernel changelog very soon <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=112</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Random I/O &#8212; Is raw device always faster than file system ?</title>
		<link>http://blog.coly.li/?p=87</link>
		<comments>http://blog.coly.li/?p=87#comments</comments>
		<pubDate>Sun, 27 Jun 2010 14:53:27 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[File System Magic]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=87</guid>
		<description><![CDATA[For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc&#8230;) might be faster than on file systems. Their choice is reasonable, 1, Random I/O on large file cannot get any help from file system page cache. 2, &#60;logical offset, physical offset&#62; mapping introduces more [...]]]></description>
				<content:encoded><![CDATA[<p>For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc&#8230;) might be faster than on file systems.</p>
<p>Their choice is reasonable,</p>
<blockquote><p>1, Random I/O on large file cannot get any help from file system page cache.</p>
<p>2, &lt;logical offset, physical offset&gt; mapping introduces more I/O on file systems than on raw disk</p>
<p>3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.</p></blockquote>
<p>The penalty for the &#8220;higher&#8221; performance is management cost, storing data on raw device introduces difficulties like,</p>
<blockquote><p>1, Harder to backup/restore the data.</p>
<p>2, Cannot do more flexible management without special management tools for the raw device.</p>
<p>3, No convenient method to access/management the data on raw device.</p></blockquote>
<p>The above penalties are hard to be ignored by system administrators. Further more, the store of &#8220;higher&#8221; performance is not exactly true today,</p>
<blockquote><p>1, For file systems using block pointers for &lt;logical offset, physical offset&gt; mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+  pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.</p>
<p>2, For file systems using extent for &lt;logical offset, physical offset&gt; mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It&#8217;s very easy to hit a hot extent in memory for random I/O on large file.</p>
<p>3, If the &lt;logical offset, physical offset&gt; mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.</p></blockquote>
<p>In order to verify my guess, I did some performance testing.  I share part of the data here.</p>
<blockquote><p>Processor: AMD opteron 6174 (2.2 GHz) x 2</p>
<p>Memory: DDR3 1333MHz 4GB x 4</p>
<p>Hard disk: 5400RPM SATA 2TB x 3 [2]</p>
<p>File size: (create by dd, almost) 2TB</p>
<p>Random I/O access: 100K times read</p>
<p>IO size: 512 bytes</p>
<p>File systems: Ext3, Ext4 (with and without directio)</p>
<p>test tool: <a href="http://www.mlxos.org/misc/seekrw.c" target="_blank">seekrw</a> [3]</p></blockquote>
<p>* With page cache</p>
<blockquote><p>- Command</p>
<p>seekrw -f /mnt/ext3/img -a 100000 -l 512 -r</p>
<p>seekrw -f /mnt/ext4/img -a 100000 -l 512 -r</p>
<p>- Performance result</p>
<p><!--   		BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Liberation Sans"; font-size:x-small } --></p>
<table border="0" cellspacing="0" frame="VOID" rules="NONE">
<colgroup>
<col width="64"></col>
<col width="75"></col>
<col width="72"></col>
<col width="86"></col>
<col width="86"></col>
<col width="86"></col>
<col width="86"></col>
</colgroup>
<tbody>
<tr>
<td width="64" height="17" align="LEFT"><span style="font-family: Liberation Serif;"><br />
</span></td>
<td width="75" align="RIGHT"><span style="font-family: Liberation Serif;">Device</span></td>
<td width="72" align="RIGHT"><span style="font-family: Liberation Serif;">tps</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_read/s</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_wrtn/s</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_read</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_wrtn</span></td>
</tr>
<tr>
<td height="17" align="LEFT"><span style="font-family: Liberation Serif;">Ext3</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">sdc</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">95.88</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">767.07</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">46024</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
</tr>
<tr>
<td height="17" align="LEFT"><span style="font-family: Liberation Serif;">Ext4</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">sdd</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">60.72</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">485.6</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">29136</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
</tr>
</tbody>
</table>
<p>- Wall clock time</p>
<p>Ext3: real time: 34 minutes 23 seconds 557537 usec</p>
<p>Ext4: real time: 24 minutes 44 seconds 10118 usec</p></blockquote>
<p>* directio (without pagecache)</p>
<blockquote><p>- Command</p>
<p>seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d</p>
<p>seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d</p>
<p>- Performance result</p>
<p><!--   		BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Liberation Sans"; font-size:x-small } --></p>
<table border="0" cellspacing="0" frame="VOID" rules="NONE">
<colgroup>
<col width="64"></col>
<col width="75"></col>
<col width="72"></col>
<col width="86"></col>
<col width="86"></col>
<col width="86"></col>
<col width="86"></col>
</colgroup>
<tbody>
<tr>
<td width="64" height="17" align="LEFT"><span style="font-family: Liberation Serif;"><br />
</span></td>
<td width="75" align="RIGHT"><span style="font-family: Liberation Serif;">Device</span></td>
<td width="72" align="RIGHT"><span style="font-family: Liberation Serif;">tps</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_read/s</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_wrtn/s</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_read</span></td>
<td width="86" align="RIGHT"><span style="font-family: Liberation Serif;">Blk_wrtn</span></td>
</tr>
<tr>
<td height="17" align="LEFT"><span style="font-family: Liberation Serif;">Ext3</span></td>
<td align="RIGHT">sdc</td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">94.93</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">415.77</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">12473</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
</tr>
<tr>
<td height="17" align="LEFT"><span style="font-family: Liberation Serif;">Ext4</span></td>
<td align="RIGHT">sdd</td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">67.9</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">67.9</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">2037</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">0</span></td>
</tr>
<tr>
<td height="17" align="LEFT">Raw</td>
<td align="RIGHT">sdf</td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">67.27</span></td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">538.13</span></td>
<td align="RIGHT">0</td>
<td align="RIGHT"><span style="font-family: Liberation Serif;">16144</span></td>
<td align="RIGHT">0</td>
</tr>
</tbody>
</table>
<p>- Wall clock time</p>
<p>Ext3: real time: 33 minutes 26 seconds 947875 usec</p>
<p>Ext4: real time: 24 minutes 25 seconds 545536 usec</p>
<p>sdf: real time: 24 minutes 38 seconds 523379 usec    (raw device)</p></blockquote>
<p>From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.</p>
<p>The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping &lt;logical offset, physical offset&gt; by extent, it&#8217;s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.</p>
<p>Dear developers, how about considering extent based file systems now <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>&#8212;</p>
<p>[1] TFS, TaobaoFS. A distributed file system deployed for http://www.taobao.com . It is developed by core system team of Taobao, will be open source very soon.</p>
<p>[2] The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.</p>
<p>[3] seekrw source code can be download from http://www.mlxos.org/misc/seekrw.c</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=87</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>a conversation on DLM lock levels used in OCFS2</title>
		<link>http://blog.coly.li/?p=81</link>
		<comments>http://blog.coly.li/?p=81#comments</comments>
		<pubDate>Fri, 30 Apr 2010 15:56:23 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[Basic Knowledge]]></category>
		<category><![CDATA[File System Magic]]></category>
		<category><![CDATA[ocfs2]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=81</guid>
		<description><![CDATA[Recently, I had a conversation with Mark Fasheh, the topic was DLM (Distributed Lock Manager) levels used in OCFS2 (Oracle Cluster File System v2). IMHO, the talk is quite useful for a starter of OCFS2 or DLM, I list the conversation here, hope it could be informative. Thank you, Mark Mark gave a simplified explanation [...]]]></description>
				<content:encoded><![CDATA[<p>Recently, I had a conversation with Mark Fasheh, the topic was DLM (Distributed Lock Manager) levels used in OCFS2 (Oracle Cluster File System v2). IMHO, the talk is quite useful for a starter of OCFS2 or DLM, I list the conversation here, hope it could be informative. Thank you, Mark <img src='http://blog.coly.li/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>Mark gave a simplified explanation on NL, PR and EX dlm lock levels used in OCFS2.</p>
<blockquote><p>There are 3 lock levels Ocfs2 uses when protecting shared resources.</p>
<p>&#8220;NL&#8221; aka &#8220;No Lock&#8221; this is used as a placeholder. Either we get it so that we<br />
can convert the lock to something useful, or we already had some higher level<br />
lock and dropper to NL so another node can continue. This lock level does not<br />
block any other nodes from access to the resource.</p>
<p>&#8220;PR&#8221; aka &#8220;Protected Read&#8221;. This is used to that multiple nodes might read the<br />
resource at the same time without any mutual exclusion. This level blocks only<br />
those nodes which want to make changes to the resource (EX locks).</p>
<p>&#8220;EX&#8221; aka &#8220;Exclusive&#8221;. This is used to keep other nodes from reading or changing<br />
a resource while it is being changed by the current node. This level blocks PR<br />
locks and other EX locks.</p>
<p>When another node wants a level of access to a resource which the current node<br />
is blocking due to it&#8217;s lock level, that node &#8220;downconverts&#8221; the lock to a<br />
compatible level. Sometimes we might have multiple nodes trying to gain<br />
exclusive access to a resource at the same time (say two nodes want to go from<br />
PR -&gt; EX). When that happens, only one node can win and the others are sent<br />
signals to &#8216;cancel&#8217; their lock request and if need be, &#8216;downconvert&#8217; to a mode<br />
which is compatible with what&#8217;s being requested. In the previous example, that<br />
means one of the nodes would cancel it&#8217;s attempt to go from PR-&gt;EX and<br />
afterwards it would drop it&#8217;s PR to NL since the PR lock blocks the other node<br />
from an EX.</p></blockquote>
<p>After read the above text, I talked with Mark in IRC,  here is the edited (remove unnecessary part) conversation log,</p>
<blockquote><p>coly: it&#8217;is an excellent material for DLM lock levels of ocfs2!<br />
mark: specially if that helps folks understand what&#8217;s happening in dlmglue.c<br />
* mark knows that code can be&#8230;. hard to follow  <img src='http://blog.coly.li/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /><br />
mark: another thing you might want to take note of &#8211; this whole &#8220;cancel convert&#8221; business is there because the dlm allows a process to retain it&#8217;s current lock level while asking for an escalation<br />
coly: one thing I am not clear is, what&#8217;s the functionality of dlmglue.c ? like the name, glue ?<br />
mark: if you think about it &#8211; being forced to drop the lock and re-acquire would eliminate the possibility of deadlock, at the expense of performance<br />
mark: think of dlmglue.c as the layer of code which abstracts away the dlm interface for the fs<br />
mark: as part of that abstraction, file system lock management is wholly contained within dlmglue.c<br />
coly: only dlmglue.c acts as a abstract layer ?  and the real job is done by fsdlm or o2dlm ?<br />
mark: yes<br />
mark: dlmglue is never actually creating resources itself &#8211; it&#8217;s asking the dlm on behalf of the file system<br />
mark: aside from code cleanliness, dlmglue provides a number of features the fs needs that the dlm (rightfully) does not provide<br />
coly: which kind of ?<br />
mark: lock caching for example &#8211; you&#8217;ll notice that we keep counts on the locks in dlmglue<br />
mark: also, whatever fs specific actions might be needed as part of a lock transition are initiated from dlmglue. an example of that would be checkpointing inode changes before allowing other nodes access, etc<br />
coly: yeah, that&#8217;s one more thing confusing me.<br />
coly:  It&#8217;s not clear to me yet, for the conception of upconvert and downconvert<br />
coly: when it combined with ast and bast<br />
coly: have you checked out the &#8220;dlmbook&#8221; pdf? it explains the dlm api (which once you understand, makes dlmglue a lot easier to figure out)<br />
coly: yes, I read it. but because I didn&#8217;t know ast and bast before, I don&#8217;t have conception on what happens in ast and bast<br />
coly: is it something like the signal handler ?<br />
mark: ast and bast though are just callbacks we pass to the dlm. one (ast) is used to tell fs that a request is complete, the other (bast) is used to tell fs that a lock is blocking progress from another node<br />
coly: when an ast is triggered, what will happen ? the node received the ast can make sure the requested lock level is granted ?<br />
mark: generally yes. the procedure is: dlmglue fires off a request&#8230; some time later, the ast callback is run and the status it passes to dlmglue indicates whether the operation succeeded<br />
coly: if a node receives a bast, what will happen ? I mean, are there options (e.g. release its lock, or ignore the bast) ?<br />
mark: release the lock once possible<br />
mark: that&#8217;s the only action that doesn&#8217;t lockup the cluster  <img src='http://blog.coly.li/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /><br />
coly: I see, once a node receives a bast, it should try best to downconvert the coresponded lock to NL.<br />
coly: it&#8217;s a little bit clear to me <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p></blockquote>
<p>I recite the log other than my own understanding, it can be helpful to get the basic conception of OCFS2&#8242;s dlm levels and what ast and bast do.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=81</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>2010 first snow in Beijing</title>
		<link>http://blog.coly.li/?p=60</link>
		<comments>http://blog.coly.li/?p=60#comments</comments>
		<pubDate>Mon, 04 Jan 2010 08:03:09 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[Great Days]]></category>
		<category><![CDATA[nature]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=60</guid>
		<description><![CDATA[Yesterday, the 2010 first snow visited Beijing. I stayed in home till midnight, then went out to take some photos. The air was so cold, I walked in the frozen wind for 1.5 hours.  It was fun to see the snow covered every where, especially the houses, cars, and plants. Several fat cats appeared on [...]]]></description>
				<content:encoded><![CDATA[<p>Yesterday, the 2010 first snow visited Beijing. I stayed in home till midnight, then went out to take some photos.</p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow1.jpg"><img class="alignnone size-full wp-image-70" title="2010firstsnow1" src="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow1.jpg" alt="" width="600" height="450" /></a></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow2.jpg"><img class="alignnone size-full wp-image-71" title="2010firstsnow2" src="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow2.jpg" alt="" width="600" height="450" /></a></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow3.jpg"><img class="alignnone size-full wp-image-72" title="2010firstsnow3" src="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow3.jpg" alt="" width="600" height="450" /></a></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow4.jpg"><img class="alignnone size-full wp-image-73" title="2010firstsnow4" src="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow4.jpg" alt="" width="600" height="450" /></a></p>
<p><a href="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow5.jpg"><img class="alignnone size-full wp-image-74" title="2010firstsnow5" src="http://blog.coly.li/wp-content/uploads/2010/01/2010firstsnow5.jpg" alt="" width="600" height="450" /></a></p>
<p>The air was so cold, I walked in the frozen wind for 1.5 hours.  It was fun to see the snow covered every where, especially the houses, cars, and plants. Several fat cats appeared on my way without glancing on me. I guesses they were looking for some warm place to stay, wish they felt comfortable last night and still be okey this morning. It&#8217;s probably that cats are stronger than me, after last night&#8217;s walk and even stayed in a warm room, I am afraid I&#8217;ve caught a chill <img src='http://blog.coly.li/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>
<p>In China, it was a perfect sign for a big snow in beginning of year 2010.  Maybe this is another excited and impressive new year, if we are more diligent and optimistic, who knows ? <img src='http://blog.coly.li/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=60</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Please help a Linux programmer&#8217;s daughter, she is dying</title>
		<link>http://blog.coly.li/?p=44</link>
		<comments>http://blog.coly.li/?p=44#comments</comments>
		<pubDate>Fri, 01 Jan 2010 08:49:06 +0000</pubDate>
		<dc:creator>colyli</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.coly.li/?p=44</guid>
		<description><![CDATA[Junting Pan, an excellent Linux programmer in Beijing, a friend of mine, his daughter is dying. I am here asking more people to save the lovely life of a little girl. Yifan, the 5 years old daughter of Pan, has a badly lung disease in past years, she almost died on Nov 11 2009 (http://help-yifan.org/img/notice.jpg).  [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.help-yifan.org"><img class="alignnone size-full wp-image-46" title="help-yifan" src="http://blog.coly.li/wp-content/uploads/2010/01/help-yifan.jpg" alt="" width="500" height="496" /></a></p>
<p>Junting Pan, an excellent Linux programmer in Beijing, a friend of mine, his daughter is dying. I am here asking more people to save the lovely life of a little girl.</p>
<p>Yifan, the 5 years old daughter of Pan, has a badly lung disease in past years, she almost died on Nov 11 2009 (http://help-yifan.org/img/notice.jpg).  In order to save her life, her parents must send their little girl to see some of the best specialists in the world, which means  a big amount of money ($300K~$500K US dollar).  This is an impossible number for a software engineer (especially in a developing country).</p>
<p><a href="http://www.yifanfund.com"><img class="alignnone size-full wp-image-47" title="yifan" src="http://blog.coly.li/wp-content/uploads/2010/01/yifan.jpg" alt="" width="300" height="200" /></a></p>
<p>Yesterday, the last day of Year 2009, I visited Yifan&#8217;s family in Beijing. Her parents sold their only house to support the treatment expense, now the whole family stayed together in a small room.  Yifan&#8217;s mother and father were brave and strong-minded, we talked about Yifan&#8217;s physical situation and current donation amount. The great news was, by Dec 31, 2009, help-yifan.org got 314K RMB Yuan donation (most of it was from China mainland), which was almost 1/10 of the expected donation amount. Yifan said hello to me, and looked at me with sweet smile. She looked like a small flower, to wait for the beautiful sunshine of he life. What a great miracle if she can have a blissful tomorrow, while what a pity if she has to leave us due to the lung disease.</p>
<p>Last week, I got the remuneration of &#8220;Linkers and Loaders&#8221; Chinese translation, and donated it to little Yifan. I wish it&#8217;s helpful, but in order to save the life of Yifan, the family needs more help from more people in the world. If you read this blog, please do not hesitate to tell Yifan&#8217;s story to your friends.</p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="450" height="363" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="align" value="middle" /><param name="allowScriptAccess" value="always" /><param name="allowFullScreen" value="true" /><param name="quality" value="high" /><param name="wmode" value="transparent" /><param name="src" value="http://www.tudou.com/player/outside/beta_player.swf?iid=43545918" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="450" height="363" src="http://www.tudou.com/player/outside/beta_player.swf?iid=43545918" wmode="transparent" quality="high" allowfullscreen="true" allowscriptaccess="always" align="middle"></embed></object></p>
<p>If you want to help Yifan, please visit <a href="http://www.help-yifan.org/" target="_blank">http://www.help-yifan.org</a> (Chinese) or <a href="http://www.yifanfund.com" target="_blank">http://www.yifanfund.com</a> (English)  , donation or volunteer are all helpful. Today is the first day of a brand new year. I wish Yifan to be able to have more new years in the future,  wish people from all of the world can help little Yifan, to make the life&#8217;s miracle happen.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.coly.li/?feed=rss2&#038;p=44</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
