In the previous blog(s) I covered some different options on how to move towards a common topology for managing servers by breaking up the management feeds into different categories such as network related, help desk specific tickets, performance related, etc. I showed an example of how you can automate the building of these views so they are data driven and self maintaining. The next step is to put the rules in place to control how state (Critical, Major, Minor, etc) are propagated up from the different categories and control how they impact the individual servers.
The way in which Novell Operations Center controls state propagation is through a technology we refer to as Algorithms. Algorithms provide a way for the administrator to describe in an XML file different types of scenario's and instruct the system on how to propagate state. One common example is around a cluster. For a cluster that has say ten active (hot) nodes with users load balanced across them, if one of those nodes were to go down, the underlying management tools would issue a critical alarm. While this is critical, since there are load balancers which will automatically move users from the down node to other nodes, the service is still up and running. Through standard out of the box rules, any critical alarm and/or critical child element will automatically float it's highest condition up the topology. For this specific use case, we would want to show the service as OK/Green for the service availability. We may want to put a rule (Algorithm) in place that says if a certain percentage of those nodes are offline, such as 40%, we may want the service to show Minor/Yellow.
For this blog I will continue on with the use case of an individual server by server rule that controls the way that underlying management metrics (alarms, KPI's, etc) are propagated, a cluster rule is another layer that would be applied multiple servers providing similar processing to a higher level service.
For our example, we have Network, Performance, Backup, Help Desk, Change Management and Process. Ok, so I have a few more children containers than the previous blogs, but I'm sure we are all fine with that. Now that we have the categories, let's go over the high level rules we want to set up.
Backups: While backups are important, if a backup fails to run in the middle of the night or completes with errors, while interesting and someone needs to work on it, it does not have an impact on the running service. For our rule, we want to have backup related alerts coming in, but we do not want those alerts to drive the condition of the server.
Network: Our network monitoring tools are pretty good. We are confident that when a networking related alarm appears (IE: host not pingable), that there is more than likely an outage of the server going on. For any critical network related alarms, we plan on making the server Critical/Red.
Process: We have agents installed on some of our critical servers to monitor the servers components (CPU, memory, etc), log files for errors as well as some specific modules designed to talk directly to the running process. We have views already set up in NOC for the application owners that roll up the alarms into buckets based on ownership, for the server view, it is not typical that errors of critical nature that show up under the Process bucket need to mark the server Critical/Red, the preference for our server monitoring team is to mark the server minor if the Process bucket is critical since the application owners are already managing their critical items. We want to raise Operations attention, but since it is not an outage we want to make sure they are working on real outages instead.
Performance: For performance related alarms, our instrumentation and/or synthetic testing is implemented in a manner to measure response times and alarm when boundaries are exceeded. Our applications are routinely exceeding thresholds, but the application is typically up and running. For this rule and for our environment, we would like the server to turn Major/Orange when there are performance related issues. (The use case is that they are working on improving their performance monitoring, but are putting controls in place to manage it through process for the short term, ie: address outages/red first, then work through the next level of problems, orange, then yellow.)
Change Management: This is more of a useful feed to provide to Operations for when they are troubleshooting or performing triage of problems within the environment. There are situations where a problem identified is already known and there is a change request already opened against the server to address it during the next maintenance window. Change requests should not drive the condition of the server, but the existence of a change request that appears related to alarm should in turn help the Operator focus on other important issues.
Help Desk: Our Help Desk team has great processes in place to categorize each individual ticket with an accurate severity. When a ticket is marked Critical, typically there is an issue going on that needs addressed right away, but unless monitoring notices an outage, help desk related outages are considered by default to be user specific. We have monitoring in place that can determine outages by individual server, a specific router or remote site. Having active Help Desk tickets in the view is helpful to the Operator, but the condition should not drive the state of the server. If users are opening tickets around an application and not being able to login and the monitoring team notice performance related issues, the two may be related.
Algorithms are stored as an XML file under the base install directory, under database/shadowed/Algorithms.xml. This file is monitored for changes and re-read in periodically as well as stored in the backend database. A line is written to the trace file when changes are noticed which is good feedback when you are testing edits of individual rules. There is a UI to edit the Algorithms which can be accessed under Administration/Server/Algorithms (right-click and choose Edit Algorithms), but most administrators prefer to edit the file directly... up to you.
Below is the algorithm based on the instructions from above. The algo starts out with a tag to name the algorithm, this ends up being the Algorithm that is selectable from within the console interface, this algo is called "Server Rule".
The next section does a gather, this is how the children of the server are gathered. Since children can be direct/real children (NAM) or in other use cases linked children (ORG), we are gathering both. For our specific use case, just gathering on NAM or ORG would not produce different results.
From there we go into a split/branch, think of this like a case statement. Only one of the branches will meet a true criteria.
The first branch reduces the list of children (Network, Performance, Help Desk, etc) down to *just* the Network child. A test is then performed around the condition of the Network category, if it is Critical (testCondition), then we want to float a critical state up to the server (result) as well as put a note on the server (reason) that identifies a network related issue.
The next branch for Performance follows the same general lead in, but a Critical Performance status is elevated to the server as a Major condition instead. Process being the next branch elevates a Minor condition when Process is Critical.
The last branch is more of a catch all, kind of like an "else" statement. In this section no conditions, properties or other things are tested, we just default the server having an OK/Green status.
<algorithm name="Server Rule">
<exec command="gather" relationship="NAM" />
<exec command="gather" relationship="ORG" />
<exec command="reduce" invert="yes" property="name" value="Network" />
<exec command="band" testCondition="CRITICAL" amount="100%" result="CRITICAL" reason="Network related issue identified" />
<exec command="reduce" invert="yes" property="name" value="Performance" />
<exec command="band" testCondition="CRITICAL" amount="100%" result="MAJOR" reason="Performance impact identified"/>
<exec command="reduce" invert="yes" property="name" value="Process" />
<exec command="band" testCondition="CRITICAL" amount="100%" result="MINOR" reason="Process issue identified"/>
<set result="OK" reason="Online" />
Just a couple things on algorithms. Algo's should be set up in a manner that they are generalized so you can write one algo and use it many times. One practice is to do an Algo based on the class of the element such as router, server, database, etc. There are cases where you may have some use case specific situations where you may want to use a different algorithm, but many times we try to address those situations within the tree/topology like we did for this blog.
Algo's can do more than I covered. There is an ability to even jump right into java script and do all kinds of crazy stuff. Just be careful, everytime there is an alarm update for an underlying element, it causes the parents to recalculate their conditions. Having a java script in an algorithm, while acceptable, there may be performance impacts, so be careful.
The last piece of this series is to assign the Server Rule algorithm to our servers, since we are driving for an automated build and ongoing update of our server view, we are going to use Service Configuration Manager. Within our existing Service Configuration Definition that built the server view, under Modeling Policies there is an Algorithm section. The next step is to set up a new Algorithm Policy. Since we have a predictive class for our servers (if I remember correctly, I used server_host), within the Name Matcher for this algorithm, remove the .* expression and set up a class match for server_host, then in the drop down select the Server Rule algorithm. Whenever a new server is added to the view, by default, the Server Rule algorithm will automatically be applied.
This concludes the series on a common approach to monitoring and portraying servers within Novell Operations Center. While this is an approach, there are many ways this can be accomplished, this blog was intended to provide an approach that I have seen with several customers and it works well for them.
Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).
It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.