GroupWise and SAN Design, Part 1
Novell Cool Solutions: AppNote
By Tim Heywood
Digg This -
Posted: 2 Oct 2007
Over the years, both in the forums and in the course of my day job, I have been involved with GroupWise systems that are intended to be hosted on clusters or centralized storage, replicated, and otherwise used and abused. A single recurring theme has stood throughout all of these occasions is that folk wish to do unto GroupWise that which it is not "wise" so to do ...
Through this AppNote and those that follow, I want to walk through the thinking and process that is required for a GroupWise system on centralized storage.
- Disk layout and LUN placement within an array, how to ensure adequate performance from the design of the arrays, and how to avoid the pitfalls that damage performance.
- What to do to replicate GroupWise data, both for business continuity as well as for disaster recovery - and how the two differ.
- Techniques to use for backup and reducing the Post Office size, either before a move to central storage or otherwise.
- Thoughts on storage in the virtualized world.
Following the articles on the storage, we'll look at the GroupWise deployment on a NetWare Cluster, a Linux Cluster, and then - the piece de résistance - deploying GroupWise in a BCC.
Just Data Storage
This section deals with the deployment of a central storage solution to host the GroupWise or other data, and how that is affected by today's technologies and deployment methodologies. In this section, the comments are as valid for a single server accessing the data as for a cluster node accessing the same. While this AppNote is still focused on NetWare and the direct effect that storage design can have upon that Network Operating System, it is just as valid for any modern operating system, whether openly acknowledged by the vendor or otherwise.
When optimizing a GroupWise system on a traditional design with local storage, we try to avoid having more than one Post Office on a single server. Why? Because having multiple Post Offices on a single server leads to competition for the same resources. Now if that is true on a single server, then it should also hold true when the resource in question is the storage solution.
Looking at a traditional 5-disk RAID on a simple server it would appear something like this:
Figure 1 - 5-disk RAID system
In this case, the whole of this array has been configured as a single disk, or LUN. Logical Unit Number (LUN) is an address for an individual disk drive, and by extension, the disk device itself. The term is used in the SCSI protocol as a way to differentiate individual disk drives within a common SCSI target device, such as a disk array. The term has become common in storage area networks (SAN) and other enterprise storage fields. Today, LUNs are normally not entire disk drives but rather virtual partitions (or volumes) of a RAID set.
If this same storage of five disks were used not by a single GroupWise Post office, but two, then there would be contention. The traditional way would be to look at the storage and break the array into 2 partitions, say half each, and have a dedicated partition for each GroupWise Post office:
Figure 2 - Partitioned storage system
By dividing the array into two parts, we reduce the fragmentation within the respective file systems for each of the Post Offices and therefore speed the net performance. By having a dedicated LUN on an Array - the LUN being the virtual disk as presented by a storage array - each of the two GroupWise systems are self-contained. Any fragmentation is internal to the file system of the one Post Office, thus containing any performance issues. Likewise, it would not be recommended to use a GroupWise volume on a NetWare server in a shared NSS Pool with a second volume of any type, as the fragmentation issues would be terrible.
Let's take this to its logical conclusion, with a SAN Array Disk Tray holding some 15 drives (14,15 and 16 all seem to be common tray sizes) and two GroupWise Post Offices. We would see a disk system that looks something like this:
Figure 3 - SAN tray with 15 drives
Each of the two LUNs presented have half of the disk capacity spread over the 15 drives. The read performance is optimized by having as many spindles as possible available.
Sounds perfect - but wait just one moment before we set this up. There are three issues with this design.
First issue: Disk capacity. If the standard SAN Array disk today is a 145GB 15K rpm disk, this RAID has a minimum capacity of 13*145, or nearly 1.9TB. That's a little large for just 2 Post Offices.
Second issue: We have a problem with NetWare and optimization of the disk subsystem that has worked so well for so many years. This problem can be described as "Elevator-Seeking" (see the description below).
Third issue: Recovery time. With the introduction of larger and larger hard drives, storage administrators have been presented with an issue: increased recovery times. In the past, 18GB drives were the standard size, and even if you had five drives in a RAID 5 array, the total storage of the array was only about 70GB. At that size, recovery from a failed drive would take only a few hours. With those five drives now closer to the 1/2 TB mark per drive, they now create a 2TB array. This larger array could take up to a day to recover (depending on system load, disk speed etc.) This can be mitigated by using RAID6 or Double parity systems (also known as ADG) and systems like RAID5EE, having "Hot Spare" disks available in the tray, either distributed or physical.
Elevator Seeking is a process where the hard disk read-write head picks up data in the direction it is travelling across the disk, rather than in the order data is requested. In this way, disk I/O requests are organized logically according to disk head position as they arrive at the server for processing. This reduces back-and-forth movements of the disk head and minimizes head seek times.All disks used today still operate under the FIFO arrangement (First in First out) and by using Elevator Seeking, NetWare manages this process in a much more effective manner. This is the reason why NetWare servers rarely benefit from a regular de-fragmentation process, though in some cases this can improve performance and there are tools to resolve this issue (such as http://www.novell.com/coolsolutions/tools/13899.html). The simple question "Have you ever heard a death rattle from a NetWare server?" has for years brought understanding as to why elevator seeking is such a tremendous advantage enjoyed by NetWare servers.
LUN Sizes and Arrangements
With the capacity of modern disks so high, the LUN sizes created are very large. They are so large that a tray with 15 disks would have demands placed upon it that would not allow for the disks to be dedicated to 2 LUNs, let alone a single LUN across so many disks - and heaven forbid that the array was installed with 300GB disks! So let's look at what we would have in an untutored, real-world scenario.
Figure 4 - Array with 6 LUNs
In the scenario above, we have just 6 LUNs spread across the array, each now holding a GroupWise Post Office. The good news is that the LUN for each Post Office now has its own dedicated LUN that is but 300GB in size, making efficient use of the space. The bad news is that a LUN is presented to an operating system as a disk.
Remember the definition of Elevator Seeking? " ... a process where the hard disk read-write head picks up data in the direction it is travelling across the disk, rather than in the order data is requested." So now we have an array that is presenting the storage as 6 virtual disks (LUNs) and Elevator Seeking is going to stack the requests for each. The problem is that as the seeking requests are ordered, there are multiple virtual disks within each physical disk, and the efficiency is therefore lost. The death rattle has arrived, and the performance of the system has now been seriously compromised.
Now some might say that while this has a theoretical impact; there is not a real-world scenario that will have a significant impact upon performance. There are two answers here: GroupWise and Scheduled Events. GroupWise, as we all know, is hugely IO intensive. Therefore, any design that compromises that IO has to be a bad thing.
Figure 5 - GroupWise Post Offices and shared storage
Looking at the typical deployment of our 6 Post Offices we can see that the resources are not shared - each Post Office is hosted on a dedicated server and has a dedicated LUN in the shared storage. We have been through the bad news of the LUNs shared on the same physical array. But what happens if, as would be expected, all of the Post Offices start their Scheduled events (say a contents check) at the same time? The impact on performance would be considerable, and would affect all of the post offices equally. All of the money that was invested in the storage array to improve availability and performance will provide the resilience but miss the mark on the speed element.
Rather than splitting all of the LUNs across all of the disks, we could break down the arrays into smaller "chunks" of storage. Then the contention for IO on each disk would be reduced, and the net positive effect of the Elevator Seeking would improve. In the figure below, the tray has been split into three arrays of 5 disks each, with 2 Post Offices per array.
Figure 6 -Try with thre arrays, 2 Post Offices per array
The downside of this is that the storage now has only one parity disk per array, as opposed to the double parity that was used in the whole tray scenario. Yet, the total storage available has been reduced by the value of a further disk. There is a side advantage that then becomes apparent: rebuild times would be significantly lower. In a real-world scenario, there would be the additional issue that the provision of a hot spare disk per tray should be considered. This will impact the performance still further by reducing the spindle count available in each array.
In this AppNote I cannot tell everyone exactly how they should layout their disks and implement their arrays for their environments. But if I have made you think about some of the criteria that you should consider in the disk layout, then I have achieved my aim. Some of the more interesting designs that I have had the pleasure of being involved with include the following:
Figure 7 - Array across multiple disk trays, single disk in each
In this case, the array was across multiple disk trays with a single disk in each. This protects the array from the loss of a tray and yet provides a dedicated array for each LUN avoiding any disk IO contention. The LUNs are large (for large Post Offices), and each Post Office has its restore area and software folders in attendance.
Another site had smaller disks (36GB) but had a major performance problem with 14 LUNs spread across a single array of 14 disks. In this case, a pair of drives in each tray was linked as a RAID1 (mirrored pair), and then these 5 pairs were striped in a single 5 logical disk RAID0.
Figure 8 - RAID1 mirrored pair (RAID10)
In this case, because the mirrored pair provides the resilience, the RAID0 (stripe) is not quite the liability it could otherwise be. This can be considered a RAID10. Although many would consider a RAID10 to be two large stripped arrays mirrored, that should be referred to as RAID 0 + 1.
The downside of a RAID10 is that the net usable space has been reduced to only 50% of the raw disks installed.
This last case is a development of the RAID10 solution with the change of design for the mirrored pairs.
Figure 9 - Different design for RAID10 solution
Instead of having the disks within the same tray, here they are hosted in different trays. This avoids the loss of a pair of disks and a catastrophic failure caused by the loss of a tray. The advantages of performance and basic resilience remain, but the chance of a catastrophic failure due to the loss of a tray has been substantially reduced.
Figure 10 - Tray configuration
In all of these cases, there is one final criterion to be considered: Does your chosen array support the design that you wish to employ? Most SAN Arrays today will support the creation of an array that is hosted on one of the two internal controllers and the mirror hosted on the second. However, there are some that do not and therefore compromise the number of possible deployment scenarios. An example would be a RAID50 with two RAID5s that are mirrored across the two controllers. Equally, there are systems that internally manage the disk arrays and the storage placement without manual intervention. These not only have the issue of multiple LUNs hosted on the same disk, but they also have no way of mitigating the impact by limiting which LUNs are competing for which disk.
In this AppNote, we have examined some the factors that can degrade the performance of a GroupWise Post Office when hosted upon a shared storage system - sadly, an all-to-common occurrence. However, when properly deployed and presented, a shared storage solution can seriously improve the net performance of a GroupWise system. Without such a deployment, clustering and BCC would also not be possible.
In the next article we'll look at how GroupWise affects the use of mirroring technologies for DR and Business Continuity solutions.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com