Multiprocessor support in NetWare
is not brand new with NetWare 6, and, in fact, the majority of MPK
features in NetWare 6 were available in NetWare 5 and some date
back to NetWare 4. Most improvements in NetWare 6 multiprocessor
support come outside the MPK because many NetWare functions are
newly multiprocessor enabled.
Critical NetWare 6 MP Components
Plainly, multiprocessor support includes quite a bit more than
a few kernel modules to recognize and use more than one processor
in the server. Many software components inside NetWare and from
application developers must work together to increase server performance
through the use of multiple processors.
Scheduler
The traffic cop inside the MPK, the Scheduler uses the MPS 1.4
Platform Support Module (MPS14.PSM) to determine the number of processors
in a server during installation. By watching processor activity,
the Scheduler decides where to send new threads requesting execution.
Programs written that do not explicitly use multiple processors
but are deemed MP safe will be spread out among available processors.
Depending on how many threads they use and whether the developer
used serialization techniques, any processor can handle the threads.
Every program in a single processor system runs on Processor 0 because
the zero processor is the only one there. Programs not MP safe automatically
run on Processor 0.
Some developers explicitly request certain processors in the program
itself. This falls outside the realm of smart programming techniques,
but it is done now and then. Typically, these programs request Processor
0 even when MP-enabled. No matter how much operating system vendors
strongly suggest this technique can cause problems, developers still
do it occasionally so NetWare supports it.
The Scheduler, after dealing with exceptions, takes each new thread
as it appears and allocates that thread to the first idle processor.
Processor 2 not doing anything? The thread goes there.
Once a thread runs on a particular processor, the Scheduler tries
hard to keep that thread on that same processor for reasons of efficiency.
There are two primary exception states where the Scheduler moves
a thread:
A thread not MP-enabled gets moved to Processor 0.
The load-balancing gets far out of balance.
Two other rare situations also can cause the Scheduler to move
a thread. If a processor is stopped by a management command, those
threads must obviously move to other processors. Threads which specify
processors by number will also be moved. Both of these situations
are rare.
When a MP-enabled program runs, here's how NetWare 6 runs the program:
Scheduler checks to see which (if any) processors are idle
Scheduler sends the waiting thread to the lowest-numbered idle
processor
Scheduler repeats the process with each waiting thread
Returning thread requests stay on the processor where they started
if at all possible
"Processor affinity" is the term for the technique to keep threads
on the same processor whenever possible. Unless one of the two exceptions
occurs, the Scheduler follows the processor affinity rule and leaves
threads alone to execute on their particular processor.
Funneling
The fancy (er, official) name for moving non-SMP programs to Processor
0 is funneling. If a thread from an older application gets assigned
to Processor 1 or above by some chance, such as not identifying
itself as non-SMP and not appearing to be a legacy application,
the funneling process within the Scheduler takes over and moves
the thread to Processor 0.
Threads which get funneled do so because:
The thread is in an MP state
The thread is executing MP enabled code
The thread calls an MP-unsafe procedure
When the above conditions occur, the Scheduler will funnel the
thread to Processor 0. Once the thread finishes the MP-unsafe procedure,
Scheduler will return the thread to the original processor.
Run Queues
A run queue, a data structure inside the operating system kernel,
holds threads ready for processing. Uniprocessor systems have a
single run queue, since they have a single processor.
Multiprocessor systems demand a new way to organize threads in
a state of readiness, and two options lead the way: global run or
per-processor run queues.
A global run queue provides a single run queue that distributes
ready threads across all processors. Since the global run queue
always has threads ready to process, no processor stays idle for
long. Unfortunately, as the number of processors increase, the global
run queue itself can become a system bottleneck.
Per-processor run queues provide an advantage in maximizing throughput
per processor due to using thread's processor affinity. Threads
almost always run on the same processor they ran previously, keeping
high speed cache information for the thread close at hand. No single
queue blocks access to all processors, eliminating the bottleneck
possibility of a global run queue.
On the other hand, per-processor run queues must have some overhead
built in to maintain load balancing. A single processor's run queue
can become heavily loaded, but the per-processor run queue can't
itself compare its load with that of other processors. An outside
mechanism (like Novell's Scheduler) must help the balancing remain
distributed.
NetWare's kernel uses the per-processor run queue, one reason NetWare
multiprocessor systems scale so well. Each processor picks up waiting
threads from its own processor run queue, allowing each added processor
to provide more total system horsepower. Yet some outside mechanism
must help load balancing to maintain the performance increase with
each processor.
Load Balancing
NetWare uses a sophisticated load balancing algorithm to maintain
relatively equal performance across multiple processors. The two
critical components of any balancing scheme are the ability to distribute
processing load quickly, yet the stability to not overreact to small
load imbalances.
The Scheduler in NetWare uses a threshold to maintain system load
balancing stability. In fact, two thresholds feed information to
the Scheduler: high trigger load and low trigger load. This option
provides the optimum balance between processor inactivity and excessive
thread migration using two system measurements.
Periodically, the NetWare Scheduler calculates the system-wide
load and the mean load (mid-point between the highest and lowest
loads). This calculation result gets applied to each individual
processor to determine if that processor is underloaded or overloaded.
The threshold margin maintain system productivity by allowing some
leeway before thread migration.
Note: although the threshold can be changed in the NetWare Remote
Manager, Novell engineers strongly recommend against making any
changes. If you must make changes, note the optimum value so you
can reset the system when you realize Novell engineers give good
advice which should be heeded.
Never Enough Cache
Memory vendors make new memory chips faster all the time, but no
external memory chips can process data as fast as memory built into
the processor itself: onboard cache memory. Running at the same
speed as the CPU, and with no delays for off-chip I/O, onboard cache
truly blazes new speed records.
Why does processor affinity receive so much attention by NetWare
engineers? To utilize onboard cache, of course.
Cache misses occur when the Scheduler forces a thread to migrate
from one processor to another. This forces a cache flush, where
the data needed by the migrated thread must be written out of the
first processor (flushed) into system RAM.
The new processor executing the thread then reads system RAM for
the thread data. As you can guess, performance engineers groan when
calculating the drop in thread performance speed with every cache
miss.
There are three types of cache:
L1 (Level 1): internal to the processor chip core and just as
fast as the processor itself
L2 (Level 2): eternal to the processor chip core, yet often inside
the processor chip housing (or cartridge), this cache is almost
as fast as the processor.
L3 (Level 3): Typically external to the processor chip and chip
housing (or cartridge).
Processors with large L1 and L2 caches cost quite a bit more money
than processors with smaller onboard caches. Where a Pentium* III
chip may have an L2 cache of 256KB, the same speed Pentium III XEON
processor may have 1MB of onboard cache. Now you understand why
servers with XEON or other high-cache processors cost so much more,
and why a cheap server won't perform as well under load as one of
the servers with larger processor caches.
Data in processor cache must always be written out to system RAM
sooner or later, of course, so other processors can take advantage
of the data if necessary, and to keep the system current and data
in balance. NetWare 6 uses a lazy-write algorithm for normal cache
data copies to RAM. When the cache management circuits realize the
cache has no more room for more data, the system writes the information
out to RAM. This puts all the modified data out where all other
processors can use the data, but on the processor's terms, not when
forced by a cache miss.
Improvements Since NetWare 5.1 MP Support
There are improvements in the MPK between NetWare 5.1 and NetWare
6, but no tremendous leap of innovation. The biggest improvements
in multiprocessor support came between NetWare 4.11 and NetWare
5, when the entire MPK upgraded considerably for the newer, more
powerful processors available and new motherboards to support them.
NetWare 6 MP Enabled Components
Since NetWare 5, the multiprocessor engineers at Novell have been
busy upgrading critical server functions to better utilized multiprocessor
servers. The list of MP-enabled components may surprise you. There
are so many we need to group them:
Specialized Servers and Critical Components
NDS® eDirectory™
Novell JVM (Java* Virtual Machine)
Search engine
Web engine
Servlet interface in NetWare Enterprise Server
Protocol Stacks
NetWare Core Protocol™ (of course)
TCP/IP (complete IP stack family)
HTTP
WebDAV (Web-based Distributed Authoring and Versioning)
LDAP (Lightweight Directory Access Protocol)
SLP2 (Service Locator Protocol)
Gigabit Ethernet, 100 Megabit Ethernet, 10 Megabit Ethernet
Token Ring 16
NNTP (NetWare News Server running Network News Transport Protocol)
Storage and Data Transfer
NSS (Novell Storage Services™)
DFS (Distributed File Services)
Fibre Channel disk support
Transport service request dispatcher
Protocol service request dispatcher
Security Features
Authentication
NICI (Novell International Cryptographic Infrastructure)
GUI Audit (new ConsoleOne® snap-in module)
Novell MP-Enabled Products
BorderManager®
GroupWise®
ZENworks® for Desktops
ZENworks for Servers
No others
Different customers will utilize different MP-enabled applications
and utilities, but every customer will benefit by running NetWare
on a multiprocessor server. Throughput, one of the bottlenecks for
servers today, gains a considerable increase with TCP/IP becoming
MP-enabled. Storage services always need more speed, at least according
to users.
With NetWare 5, SMP systems provided performance improvements for
specific applications. NetWare 6 increases performance many ways
on multiprocessor servers, speeding the entire user experience through
improved MP-enabled functions within NetWare. |