An enjoyed kernel apprentice Just another WordPress weblog

February 27, 2017

RAID1: avoid unnecessary spin locks in I/O barrier code

Filed under: Block Layer Magic,kernel — colyli @ 11:56 am

When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and wait_barrier() code,

|        – 41.41%  fio [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
|             – _raw_spin_lock_irqsave
|                         + 89.92% allow_barrier
|                         + 9.34% __wake_up
|        – 37.30%  fio [kernel.kallsyms]  [k] _raw_spin_lock_irq
|              – _raw_spin_lock_irq
|                         – 100.00% wait_barrier

The reason is, in these I/O barrier related functions,

– raise_barrier()
– lower_barrier()
– wait_barrier()
– allow_barrier()

They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form.

In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions,

  • wait_barrier()

Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed.

  • wait_read_barrier()

Since regular read I/O won’t interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf->nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn’t need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

 

Thanks to Shaohua and Neil, very patient to explain memory barrier and atomic operations to me, help me to compose this patch in a correct way. This patch is merged into Linux v4.11 with commit ID 824e47daddbf.

RAID1: a new I/O barrier implementation to remove resync window

Filed under: Block Layer Magic,kernel — colyli @ 11:42 am

‘Commit 79ef3a8aa1cb (“raid1: Rewrite the implementation of iobarrier.”)’ introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don’t need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time.

The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep “Fixes: 79ef3a8aa1” in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regression.

Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,

  • Do not maintain a global unique resync window
  • Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa.
  • I/O barrier can be reduced to an acceptable number if there are enough barrier buckets

Here I explain how the barrier buckets are designed,

  • BARRIER_UNIT_SECTOR_SIZE

The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.

Bio requests won’t go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough.

Neil Brown also points out that for resync operation, “we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame”. For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync.

  • BARRIER_BUCKETS_NR

There are BARRIER_BUCKETS_NR buckets in total, which is defined by,

#define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT – 2)
#define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS)

this patch makes the bellowed members of struct r1conf from integer to array of integers,

– int     nr_pending;
– int     nr_waiting;
– int     nr_queued;
– int     barrier;
+ int   *nr_pending;
+ int   *nr_waiting;
+ int   *nr_queued;
+ int   *barrier;

number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT – 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size.

  • I/O barrier bucket is indexed by bio start sector

If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function,

+    static inline int sector_to_idx(sector_t sector)
+   {
+            return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
+                                             BARRIER_BUCKETS_NR_BITS);
+   }

Here sector_nr is the start sector number of a bio.

  • Single bio won’t go across boundary of a I/O barrier unit

If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size.

Comparing to single sliding resync window,

  • Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window.
  • But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice.
  • Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation.

 

Great thanks to Shaohua and Neil, review the code, point out many bugs, and provide very useful suggestion. Finally we make it, this patch is merged in Linux v4.11 with commit ID fd76863e37fe.

Powered by WordPress