An enjoyed kernel apprentice

July 25, 2010

Don’t waste your SSD blocks

Filed under: File System Magic — colyli @ 11:20 pm

These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was,

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 77418272 184216 73301344 1 /mnt

As well as from fdisk output, it said,

Device Boot Start End Blocks Id System
/dev/sdb1
7834 17625 78654240 83 Linux

From his observation, before format the SSD, there was 78654240 1k blocks available on the partition, after the format, 77418272 1k blocks could be used, which means almost 1G space unused from the partition.

A more serious question was, from the output of df, used blocks + available blocks = 73485560, but the file system had 77418272 blocks — 4301144 1k blocks disappeared ! This 160G SSD costs him 430USD, he complained around 15USD was payed for nothing.

IMHO, this is a quite interesting question, and asked by many people for many times. This time, I’d like to spend some time to explain how the blocks are wasted, and how to make better usage of every block on the SSD (since it’s quite expensive).

First of all, better storage usage depends on the I/O pattern in practice. This SSD is used to store large file for random I/O, especially most of the I/O (99%+) is reading on random file offset, the writing can almost be ignored. Therefore, it is wanted to use every available block to store a very big files on the Ext3 file systems.

If only using the default command line to format an Ext3 file system like “mkfs.ext3 /dev/sdb1″, mkfs.ext3 will do the following things for block allocation,

- Allocates reserved blocks for root user, to avoid non-privilege users using up all disk space.

- Allocates metadata like superblock, backed superblock, block group descriptors, block bitmap for each block group, inode bitmap for each block group, inode table for each block group.

- Allocates reserved block group blocks for offline file system extension.

- Allocates blocks for journal

Since the SSD is only for data storage, no operation system installed on it, and writing performance is disregarded here, and no requirement for further file system size extension, and only a few files are stored on the file systems, some blocks allocation is unnecessary and useless,

- Journal blocks

- Inodes blocks

- Reserved group descriptor blocks for file system resize

- Reserved blocks for root user

Let’s run dumpe2fs to see how many blocks are wasted on the above items, I only list part of the output (outlines) here,

> dumpe2fs /dev/sdb1

Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          f335ba18-70cc-43f9-bdc8-ed0a8a1a5ad3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              4923392
Block count:              19663560
Reserved block count:     983178
Free blocks:              19308514
Free inodes:              4923381
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1019
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512

Filesystem created:       Tue Jul  6 21:42:32 2010
Last mount time:          Tue Jul  6 21:44:42 2010
Last write time:          Tue Jul  6 21:44:42 2010
Mount count:              1
Maximum mount count:      39
Last checked:             Tue Jul  6 21:42:32 2010
Check interval:           15552000 (6 months)
Next check after:         Sun Jan  2 21:42:32 2011
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      3ef6ca72-c800-4c44-8c77-532a21bcad5a
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             128M
Journal length:           32768
Journal sequence:         0×00000001
Journal start:            0

Group 0: (Blocks 0-32767)
Primary superblock at 0, Group descriptors at 1-5
Reserved GDT blocks at 6-1024
Block bitmap at 1025 (+1025), Inode bitmap at 1026 (+1026)
Inode table at 1027-1538 (+1027)
31223 free blocks, 8181 free inodes, 2 directories
Free blocks: 1545-32767
Free inodes: 12-8192

[snip ....]

The file system block size is 4KB, which is different from the output block size of df and fdisk. In the above output, I mark the outlines with RED color. Now let’s look at the line for reserved block,

Reserved block count:     983178

These 983178 4K blocks are served for root user, since the system and user home is not on SSD, we don’t need to reserve these blocks.  Read mkfs.ext3(8), there is a parameter ‘-m’ to set reserved-blocks-percentage, set ‘-m 0′ to reserve zero block for privilege user.

From file system features line, we can see resize_inode is one of the default enabled feature,

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file

resize_inode feature reserves quite a lot blocks for new extended block group descriptors, these blocks can be found from lines like,

Reserved GDT blocks at 6-1024

When resize_inode feature enabled, mkfs.ext3 will reserve some blocks after block group descriptor blocks, called “Reserved GDT blocks”.  If file system will be extended in future (e.g. the file system is created on a logical volume), these reserved blocks can be used for new block group descriptors. Now the storage media is SSD, not file system extension in future, we don’t have to pay money (on SSD, blocks means money) for this kind of blocks. To disable resize_inode feature, use “-O ^resize_inode” in mkfs.ext3(8).

Then look at these 2 lines for inode blocks,

Inodes per group:         8192
Inode blocks per group:   512

We only store no more than 5 files on the whole file systems,  but here 512 blocks in each block groups are allocated for inode table. There are 601 block groups, which means 512×601=307712 blocks (≈ 1.2GB space) wasted for inode tables.  Using ‘-N 16′ in mkfs.ext3(8) to specify only 16 inodes in the file system, though mkfs.ext3(3) at least allocate one inode table block in each block group (more then 16 inodes), we only wast 1 block other than 512 blocks for inode able now.

Journal size:             128M

If most of the I/O are readings while writing performance is ignored, and people are really care about space usage, the journal area can be reduced to minimum size (1024 file system blocks), for 4KB blocks Ext3, it’s 4MB: -J size=4M

By above efforts, there is around 4GB+ space back to use. If you really care about the space usage efficiency of your SSD, how about making the file system with:

mkfs.ext3 -J size=4M -m 0 -O ^resize_inode -I 16  <device>

Then you have chance to get more data blocks into usage on your expensive SSD :-)

1 Comment »

  1. Hi!

    Thank you for the interesting post. I learned much from it.

    I noticed though that there is a small error in the post: you mention the ‘-N’ option that specifies the number of inodes for the filesystem, but then in the example at the end you write ‘-I 16′ instead of ‘-N 16′. Maybe while writing the example you were thinking of another space wasting filesystem feature: the possibility to create large inodes for storing extended attributes. If someone doesn’t use extended attributes, then it is pointless to create the filesystem with large inodes. According to the mkfs.ext3 man page the smallest possible inode size is 128 bytes if you don’t use extended attributes or any other feature that requires the extra inode space.

    Comment by Tamás Kiss — August 3, 2010 @ 3:58 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress