Because e-mail, scheduling, address books, and calendars are mission-critical applications, system reliability is one of the most important aspects of any messaging system. Indeed, 99.99% uptime requirements are not uncommon. System fault tolerance is achieved when there is no single point of failure in the system; that is, if any one server fails, mail users are unaffected.
NetMail allows you to implement redundancy and failover support at two levels: the application level and the hardware level.
Application-level clustering consists of duplicating mail services on multiple servers. Because of the NetMail system's highly modular architecture and eDirectory replication, critical services can run simultaneously on multiple servers and provide the same service. Consequently, you can provide fault tolerance for most mail services at the application level.
In comparison to hardware-level clustering, application-level clustering is relatively inexpensive. It is innate to NetMail and does not require specialized hardware. In fact, servers in a NetMail application cluster do not even need to run the same operating system. As an added benefit, application clustering automatically provides load balancing because all servers in a NetMail application cluster can be active at all times. Consequently, application-level clustering is the first choice in building system fault tolerance.
NetMail client agents (POP, IMAP, Modular Web) provide the same service, regardless of which server they on. Therefore, you can configure these agents to run on as many servers as needed to handle messaging traffic and provide fault tolerance.
Load balancing is achieved with round-robin DNS or with layer 4 switching. Round-robin DNS provides load balancing by resolving one host name to multiple IP addresses. The IP addresses returned by the DNS server are rotated so one IP address is not preferred. The one disadvantage is that round-robin DNS cannot detect if one IP address is not responding; therefore, it does not provide fault tolerance.
Layer 4 switching, on the other hand, provides both load balancing and fault tolerance. Like round-robin DNS, layer 4 switching distributes the traffic sent to one IP address to multiple IP addresses. However, it can also recognize when one IP address is not responding and automatically direct client requests to the remaining servers.
IMPORTANT: If you are using the Modular Web Agent behind a layer 4 switch, the switch must also guarantee that all requests coming from a single IP address are always redirected to the same server. This is necessary because the Modular Web Agent maintains session information on the server. (The POP and IMAP Agents do not have this restriction.)
Fault tolerance is built into the SMTP protocol. Consequently, layer 4 switching is not required to provide load balancing and fault tolerance on servers used to exchange mail with other mail systems. Instead, load balancing and fault tolerance are provided by publishing multiple MX records in DNS---one for each SMTP server. By giving all MX records the same preference value, incoming mail is automatically distributed across the servers. If one of the servers goes down, the SMTP protocol requires the sending server to try all other MX records before giving up.
Running SMTP services on one or more dedicated servers also insulates the messaging system from spam storms. In a spam storm, the messaging system is suddenly besieged by hundreds, if not thousands, of relayed messages and their bounced returns. If the SMTP Agent and its queue server do not reside on the same server as the client agents or the NMAP Agents responsible for the mail store, users will not even know when the system is under a heavy load.
The message store is the only NetMail component that cannot be cloned at the application level. Because only one NMAP Agent can service a given user context and its associated mailboxes, hardware-level clustering is required for the message store to failover to another server in the event of a failure so users can still retrieve their mail.
IMPORTANT: Configuring more than one NMAP server to service the same user contexts is not allowed and produces unpredictable behavior in the NetMail system.
Hardware-level clustering consists of shared storage and hardware failover. Because of the disk space and hardware requirements, hardware-level clustering is, typically, very expensive.
Because of their analog and moving parts, the power supply and disk drives are the most common hardware elements to fail. Consequently, to provide fault tolerance, it is typically sufficient to have a redundant power supply and a Redundant Array of Independent Disks (RAID). Although it might be ideal to have a second server, make sure your drives and power supply are redundant before investing in a backup server.
For NetWare servers running NetMail, we recommend RAID level 1 for disk drive redundancy and failover support. Although RAID 1 requires 50% of your disk capacity, it provides the highest level of performance.
RAID 5 gives you the same level of fault tolerance with a smaller percentage of disk space, but it is not as fast as RAID 1. Therefore, because performance is the determining factor in most messaging systems (and because disk space is inexpensive), RAID 1 is the better choice.
For information on configuring NetMail to take advantage of hardware-level clustering, see Configuring NetMail to Use Novell Cluster Services.