Partition alignment of drives with internal sector size larger than 512 bytes

  • 7007193
  • 11-Nov-2010
  • 30-Apr-2012

Environment

SUSE Linux Enterprise Server 11 Service Pack 1
SUSE Linux Enterprise Desktop 11 Service Pack 1
openSUSE 11.3

Situation

There are storage drives -- hard disk drives (HDDs) and solid state disks (SSDs) -- available today that while maintaining a classic 512 Byte sector size on an interface level they work with larger block sizes internally. This article is about how to best set up the SUSE Linux Enterprise Operating System to achieve best performance on them. The article applies in a similar way to openSUSE and other Linux distributions. The article is NOT targeted towards drives that expose sector sizes larger than 512 Byte at the interface level such as some recent external drives with more than 2TB space are doing. These disks are used with their native sector size and no adjustment is required.

The issue

The drives achieve their best performance when accesses are aligned with the internal block size. As the Linux kernel typically does accesses of multiples of the hardware page size (4k on x86), unaligned reads often would result in one more internal block to be read then aligned accesses. Worse, on writes that only cover a partial internal block, the drives might need to do expensive read-modify-write (RMW) cycles rather than just a write. So on a rotating drive with 4k internal block size, a single 4k write that's unaligned may incur an 11ms penalty on a 5400rpm HDD.

Note: Some large storage arrays (SAN) use a 4k blocksize internally, too, without necessarily report it to the OS. So they will be profit from partition alignment as well.

Partition alignment

The classical DOS partition alignment is unfortunate. With the classical C/H/S = X/255/63 pseudo geometry translation scheme and the convention to start the first partition at C/H/S = 1/1/0 (note that cylinder counting traditionally starts with one for reasons that complaining about would be beyond the scope of this article), which translates to a linear (LBA) offset of 63 -- which is misaligned with anything that's used anywhere and larger than 512 Bytes.

The solution suggested in this article is to have the partitions start at aligned addresses. One way to achieve that is to use different CHS schemes; using C/H/S = Y/240/56 e.g. would result in a 64k (128 sector alignment) for primary partitions -- except that the first primary partition would only be 4k aligned at offset LBA 56. As this will not ensure a good alignment of the first primary partition nor the the logical partitions in the extended partition, the description here won't bother to go away from the classical CHS translation. Rather it will use the fact that partitions don't need to start on cylinder boundaries but can be moved to start at the next aligned address.

Sidenote: This issue can NOT be avoided by using other, non-DOS partition table formats, like the GUID Partition Table (GPT). Even when using a GPT partitioning scheme you need to ensure that the partition is aligned properly. The benefit GPT gives you here is that you can use disks larger than 2TB, and as basically all of them are using 4k block sizes internally you would want to follow the guidelines stated in this document. Please be aware that many other OSes can not access GPT partitions and most BIOSes can't boot from a GPT disk, so check the compability before using it.

What alignment?

Before doing the work, a decision needs to be taken what alignment should be used. If the internal block size is known (like for disks with internal 4k sectors), that one could be chosen. For SSDs, it is generally not known.

But the friends from Redmond provide guidance here -- as Windows 7 by default uses 1M partition alignment, it is save to assume that most drives will be optimized to provide good performance with such alignment. So in case of doubt, it will never hurt to align partition starts to 1M boundaries.

There's one special case: When internal 4k block sizes were introduced, some HDD manufacturers actually addressed the classical DOS partition table misalignment by shifting the logical sector counting by one, so a start at sector 63 would translate to sector 64 (i.e. internal block 8). Some HDDs even were configurable with a switch to do this shifting by one. The SATA spec did even provide a mechanism for the drives to report such an offset, so the OS can take the appropriate steps to optimize performance. To our knowledge not many such drives exist; and only a subset of them reports the offset correctly.

If the drive reports any alignment offset, the Linux kernel in SLE11-SP1 (or later) will report this via the attribute /sys/block/$DEV/alignment_offset (in sectors).

Some drives will report their internal block size via /sys/block/$DEV/queue/physical_block_size though the SSDs tested all reported 512 (bytes) there and do not report the internal erase block size which almost certainly is larger.

In summary, going with 1M (2048 sectors) alignment is still a good default choice -- unless we know about an alignment_offset. For convenience, there is a little python script that can be used to calculate recommended partition offsets athttp://www.suse.de/~garloff/align_partition.py

Moving partitions is dangerous

You need to do the partition alignment BEFORE creating a filesystem on the partition, as moving the beginning of a partition will render existing filesystems unaccessible. Let me repeat that: DOING PARTITION ALIGNMENT ON PARTITIONS WITH EXISTING FILESYSTEMS WILL CAUSE A LOSS OF DATA. So make sure you have working backups (if there is anything to backup).

This means you should follow the steps described below before using a disk -- when you want to do an installation to the disk, the recommendation is to first boot into a rescue system, doing the partitioning and then reboot to start the real install process, using the partitions unchanged and just putting filesystems on them. (There is an option to change to the text console when running an installation via YaST and do things there -- but you'd need to make sure to have YaST reread the partition table eventually.)

For secondary disks this is obviously easier -- you just do the steps out of a running system.

Note that if ANY partitions from a disk you modify the partition table of are is use, e.g. because they contain mounted filesystems, your modifications will only become visible upon reboot, so please don't do mkfs or such on changed partitions before rebooting in such a case, please.

Please also note that on some hard disks it is possible to use a jumper which internally moves all of the logical 512 byte sectors by one. You need to make sure that this jumper is not set!

Moving partitions with fdisk

The following step-by-step instructions provide a description how to interactively move the beginning of partitions using the export mode of fdisk. There are other ways (using e.g. parted) that are not covered here.

You need to have write access to the raw disk device (e.g. /dev/sdb) to do the following steps -- typically this means you need to be root. If starting with a vanilla disk, first create partitions of the size that you like. This can be done using fdisk, parted or more user- friendly tools such as the YaST partitioner.

WHEN CHANGING PARTITIONS, IT'S HIGHLY RECOMMENDED THAT YOU DOUBLE CHECK YOU ARE WORKING ON THE INTENDED DISK BEFORE DOING ANYTHING; THE RISK OF LOSING DATA IS VERY HIGH OTHERWISE. Careful people always have a paper hardcopy of their partition tables created by e.g. fdisk -l | lpr so they can recover from such mistakes. Another way if you exclusively have primary partitions is to save the first 512 bytes from your hard disk (containing the master boot record and the partition table) to a file using dd or dd_rescue.

Now, let's move the beginning of the partitions to be well aligned. Let's assume your hard disk is called sdb.

  1. Start fdisk by calling fdisk /dev/sdb
  2. Print the partition table just to be sure you are looking at at the right disk: p 
  3. Go to export mode: x 
  4. Use the b command to move the beginning of a partitions: b
  5. Choose the partition you want to modifiy: NUMBER 
  6. fdisk will prompt you for the NEW offset and will have a default proposal that corresponds to the OLD offset. NOTE: These offsets are in units of logcal sectors (512 bytes)
  7. Calculate the new offset by rounding the offset UP to the next number that fits your alignment desire, e.g. the next larger multiple of 8 if you want to achieve 4k alignment. (You can use the align_partition.py script to do the math for you.)
  8. Enter the new offset and press enter
  9. Repeat for all partitions (go back to step 4)
  10. When done leave export mode: r 
  11. Review the partition table: p It's not yet on the disk, so if you screwed up, now is the time to abort with: q 
  12. When satisfied, write the changes to disk: w This will also leave fdisk.

On the last step please watch for messages of the kernel failing to read your new partition table. This would mean that some partition of your disk is in use and that you'd need to reboot to fix up. However, if this happens, there is a chance that you have actually screwed up and modified the wrong disk :-( Now's the time to use the "fdisk -l" printout and restore the old partition table manually ...

If everything went well, you can now start creating filesystems using mkfs (or mkswap for swap space) or your favorite GUI tool on the partitions or continue with LVM2 setup.

Additional hints

According to our experience it's not worth to fiddle with the stride= pararmeter in ext2/3/4 for 4k drives or SSDs. If you set up a raid system on top of aligned partitions, it helps to use chunk sizes that are multiples of the internal block size -- though with a default of 64k in mdadm, this typically does not need any interventions.

For SSDs, using deadline or noop IO scheduler tends to provide a minor increase in performance over CFQ -- though the latter detects the fact that SSDs are non rotational devices (in SLE11SP1 or later) and optimizes rather well for that case as well. So it's a matter of trading minor performance gains via the ability to do some QoS with CFQ.

You might also achieve minor gains by reducing the readahead size for SSDs -- though putting it down to very small values will hurt your linear (streaming) read performance a bit there as well.

For SSDs, one thing that has been observed especially with the first generation of drives is that their write performance drops dramatically as soon as the drives run out of empty erase blocks, which happens after using them for a while.

More modern drives address this by recycling unused (zeroed-out) space automatically and allowing the OS to tell the SSD about unused blocks using the TRIM command. SLE11 SP1 ships with wiper.sh which will send down appropriate TRIM commands by analyzing a filesystem. Note that this should be only used after having done a backup. Also it has certain limitations, like e.g. not supporting LVM or RAID and not supporting some file systems at all (btrfs) or only supporting offline or read-only trimming for some filesystems.

Using SSDs to put your root filesystem on (and following the instructions in this article) is possibly the most efficient investment into improving the interactive experience using your system. Boot times and response times of the system tend to experience huge improvements. You can also put the swap partition on an SSD, which will result in making a swapping system usable for much longer -- though the access patterns of swapping seem to make the first gen SSDs degrade rather quickly in write performance.

When you mount filesystems using the relatime -- or better, if you can validate that noatime does not hurt your use case -- mount option is very beneficial -- this applies in a rather general way though, not only to 4k HDDs or SSDs. relatime is used by default in SLE11 SP1.