Troubleshooting ndsd becoming unresponsive on OES Linux

  • 7007332
  • 07-Dec-2010
  • 18-Dec-2013

Environment

Novell eDirectory 8.8 for Linux
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 1
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 2
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 3
Novell Open Enterprise Server 11 (OES 11) Linux 
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 1

Situation

When troubleshooting issues with the ndsd process, there are more things to consider than just eDirectory.

 Here is a list of some of the pieces that could also have an effect on eDirectory:  


  • Novell Core Protocol (NCP)
  • Novell Storage Services (NSS)
  • Linux User Management (LUM)
  • ZenWorks
  • Novell Modular Authentication Service (NMAS)
  • LDAP
  • Novell Auditing (lcache)

 Symptoms:

Can't login to server
Slow login times to server
Can't map drives to the server
Ndsd daemon is running but no activity in ndstrace.
Ndsd process pegged at high cpu utilization displayed in top.

Resolution

1.  Apply the latest patches from the channel.

First step is to make sure the server has the latest patches. Tid 3426981 is a great reference which lists all fixes that have been included in each major and minor version of eDirectory. Check it to see if your issue has already been addressed.

 

 

2.  Use supportconfig

 

You can use supportconfig to upload the config bundle and this data will automatically be processed by Novell's Support Advisor server.  The report will then be automatically attached to the Service Request. 


Use: "supportconfig -ur SR#"

This report can be viewed through Novell's Customer Care Center portal.
 
3.Consult the Novell eDirectory Tuning Guide for UNIX* Platforms for guidance on how to optimally tune eDirectory.

This guide provides information on tuning your eDirectory environment for improved performance.
This guide is broken up into the following sections:

eDirectory Subsystems - describes primary subsystems that influence eDirectory performance.
Analyzing System Bottlenecks - describes various system resources and their influence on eDirectory performance.
Tuning eDirectory Subsystems -  describes how to analyze and tune eDirectory under various conditions and environments.
The eDirectory Configuration - describes how to configure various tunable parameters.
 
4.  While in high utilization, grab some ps and gstack output of the ndsd process.

Gstack will dump out stack traces of each thread under the ndsd process.  This can provide insight on what might be causing the utilization.  Please refer to Tid 70071838 for more information.

 
In addition to gstack, the “ps†utility can be very helpful in isolating the “causeâ€.  It’s best to grab each of these as concurrently as possible.  The following ps options have been found to be most helpful:

 

ps -C ndsd -L -o pid,tid,nlwp,pcpu,pmem,vsz,stat

 

-C allows you to specify process by command name.

-L will include the threads under ndsd (LWP = light weight process = child threads)

-o allows custom display of desired info.

pid = process id

tid = thread id

nlwp = number of threads under the process

pcpu = percent of the cpu the process is using

pmem = percent of the total memory the process is using

vsz = virtual memory size of the process in kilobytes

stat = current state of the process (for more info, “man ps†and search for  “PROCESS STATEâ€)

 

See additional information section below for more information and a sample script that could be used to gather this and other relevant data..

 

  

5.  Force a core dump of the ndsd process for NTS to analyze.

 

Along the same lines as the stack traces, it may be useful to force a core of the ndsd process. The core contains more information that can be useful to NTS. The gcore script is a part of gdb on Suse and can be used to accomplish this. Gcore will not halt ndsd or cause ndsd to become less responsive than it already is. The gcore utility can be used as follows: 

 

gcore `pgrep ndsd` (or you can find the pid # for ndsd using ps –eaf | grep ndsd

This will generate a core file in the current directory.

 

Note: There have been some cases with certain versions of gdb and the kernel that may not provide a useful core. See Tid 7004715&sliceId=1 for more information. At the very least grab the stack traces as the gstack utility does note have this issue.

 

 

6.  Try to narrow the scope.

 

You can prevent some components of ndsd from auto loading on startup by modifying the ndsmodules.conf. This file is typically located at /etc/opt/novell/eDirectory/conf/ndsmodules.conf. Some of the components can not be remarked out as they are necessary for ndsd to function. These ones are labeled with system,fail in the second column. The following are some modules that you do not want to remark out: 

 

dhlog auto,fail #DHost logger
ncpengine auto,system,fail #Core NCP Services
dsloader auto,system,fail #Loader
masv auto,system,fail #Modular Authentication Services
nds auto,system,fail #Core DS Services

Here are some components which could be remarked out in order to see if utilization or hang persists (and still retain basic eDirectory functionality.

 

#httpstk auto #DHost HTTP Stack
#hconserv auto #HConServ
#nldap auto #LDAP Server
#imon auto #iMon
#embox auto #eMBox
#pkiserver auto #PKI server
#ssncp auto #SecretStore 

This methodogy would typically be used when the hang / utilization happens pretty quickly after startup. If after making these changes utilization / hang does not occur, start adding the modules back one-by-one until you know the culprit.

 

 

7. Bump up the log level and see what's going on in the ndsd.log.

 

The ndsd.log is typically located in /var/opt/novell/eDirectory/log/ndsd.log.  The debug level for the ndsd.log can be increased by modifying / adding the following variable in the nds.conf file: (typically located in /etc/opt/novell/eDirectory/conf/nds.conf)

 

n4u.server.log-levels=Logxxxx

 

The default level is LogError.  During troubleshooting, it would be a good idea to at least set this level to "LogWarn"  This may help identify problem that wouldn't normally show up in the ndsd.log.  This would then require a restart of ndsd for this change to take effect. 

 

For more information on log levels see Novell's Documentation.

 

 

8.  NCP debugging

 

The ncpengine on OES linux is divided up between filesystem and eDirectory threads. There are a finite number of threads defined for each by default. A new enhancement available in ncpcon can be used to identify the number of threads in use for each. If either thread pool has hit it's ceiling it could lead to a hang or high utilization. Based on this output, the ceiling could be increased to possibly help the situation. For more information see: Tid 7007134.

 

In addition to the thread information it would be a good idea to turn on ncp logging. This can be done by typing "ncpcon log everything". This will write to multiple logs in the /var/opt/novell/log/* directory that could be zipped up and sent to NTS.

 

If nss volumes are being used, you may want to consider tuning as sometimes the default settings are not optimal for all environments and could lead to the same symptoms. Please refer to TID 7004888 for more information on how to do this.

 

 

9. Linux User Management (LUM/namcd)

 

Depending on how LUM is configured, it could lead to issues with utilization, especially if many different OES boxes are pointing to the same preferred server. Then /etc/nam.conf is where this variable is stored. Typically it is best to point “preferred-server=†to it's own IP Address. It may also be a good idea to set “persistent-search from "yes" to "no" as has also causes high utilization in some cases. This variable is also in the /etc/nam.conf . For more information see: Tid 7006086


Check that persistent search is disabled on the ldap server object as well.

Use iManager or ldapconfig to check.  For ldapconfig do:

ldapconfig get ldapEnablePSearch

At the prompt enter the FQN for admin in .x500 example: admin.novell

To set the persistent search to no do the following:

ldapconfig set "ldapEnabledPSearch=no"

 

10.  Novell Modular Authentication Services (NMAS)

 

NMAS is heavily integrated with eDirectory these days. It is used to enforce/enable various login methods and provide enhanced password features such as Universal Password. There have been cases where nmas has caused utilization issues. This will likely effect Netware servers utilization, but worth mentioning. For more info see: Tid 7002047

 

 

11. Audit (lcache)

 

Novell Auditing has been known to cause ndsd to become unresponsive at times. If auditing is running consider taking it out of the picture to see if the problem goes away.

 

 

Additional Information

Gstack and ps output example:
The following is an example of how the information from a gstack and ps output helped identify what was causing ndsd to go into high utilization/become unresponsive on an OES linux server.
Output from ps command:  The bolded line (thread  id 4179) is the problem thread.
PID    TID  NLWP %CPU %MEM    VSZ STAT
4132  4132   86  0.0  2.4 693756 Sl
 4132  4164   86  0.0  2.4 693756 Sl
 4132  4166   86  0.0  2.4 693756 Sl
 4132  4167   86  0.0  2.4 693756 Sl
 4132  4169   86  0.0  2.4 693756 Sl
 4132  4170   86  0.0  2.4 693756 Sl
 4132  4171   86  0.0  2.4 693756 Sl
 4132  4172   86  0.0  2.4 693756 Sl
 4132  4173   86  0.0  2.4 693756 Sl
 4132  4174   86  0.0  2.4 693756 Sl
 4132  4175   86  0.0  2.4 693756 Sl
 4132  4176   86  0.0  2.4 693756 Sl
 4132  4179   86 27.6  2.4 693756 Rl
 4132  4181   86  0.0  2.4 693756 Sl
 4132  4187   86  0.0  2.4 693756 Sl
 4132  4188   86  0.0  2.4 693756 Sl
 4132  4314   86  0.0  2.4 693756 Sl
 4132  4315   86  0.0  2.4 693756 Sl
 4132  4316   86  0.0  2.4 693756 Sl
 4132  4317   86  0.0  2.4 693756 Sl
 4132  4318   86  0.0  2.4 693756 Sl
 4132  4319   86  0.0  2.4 693756 Sl
 4132  4320   86  0.0  2.4 693756 Sl
 4132  4321   86  0.0  2.4 693756 Sl
 4132  4322   86  0.0  2.4 693756 Sl
 4132  4326   86  0.0  2.4 693756 Sl
 4132  4327   86  0.0  2.4 693756 Sl
 4132  4328   86  0.0  2.4 693756 Sl
 4132  4391   86  0.0  2.4 693756 Sl
 4132  4469   86  0.0  2.4 693756 Sl
 4132  4850   86  0.0  2.4 693756 Sl
 4132  4851   86  0.0  2.4 693756 Sl
 4132  5055   86  0.0  2.4 693756 Sl
 4132  5057   86  0.0  2.4 693756 Sl
 4132  5061   86  0.0  2.4 693756 Sl
 4132  5062   86  0.0  2.4 693756 Sl
 4132  5063   86  0.0  2.4 693756 Sl
 4132  5064   86  0.0  2.4 693756 Sl
 4132  5065   86  0.0  2.4 693756 Sl
 4132  5066   86  0.0  2.4 693756 Sl
 4132  5067   86  0.0  2.4 693756 Sl
 4132  5282   86  0.0  2.4 693756 Sl
 4132  5283   86  0.0  2.4 693756 Sl
 4132  6013   86  0.0  2.4 693756 Sl
 4132  6920   86  0.0  2.4 693756 Sl
 4132  7201   86  0.0  2.4 693756 Sl
 4132  8048   86  0.0  2.4 693756 Sl
 4132 22148   86  0.0  2.4 693756 Sl
 4132  9246   86  0.0  2.4 693756 Sl
 4132  7601   86  0.0  2.4 693756 Sl
 4132 18933   86  0.0  2.4 693756 Sl
 4132 19073   86  0.0  2.4 693756 Sl
 4132 19085   86  0.0  2.4 693756 Sl
 4132 19841   86  0.0  2.4 693756 Sl
 4132 21351   86  0.0  2.4 693756 Sl
 4132  2865   86  0.0  2.4 693756 Sl
 4132 11750   86  0.0  2.4 693756 Sl
 4132 11853   86  0.2  2.4 693756 Sl
 4132 15395   86  0.0  2.4 693756 Sl
 4132 30187   86  0.0  2.4 693756 Sl
 4132 30318   86  0.0  2.4 693756 Sl
 4132  3448   86  0.0  2.4 693756 Sl
 4132  3697   86  0.0  2.4 693756 Sl
 4132  8913   86  0.0  2.4 693756 Sl
 4132 15140   86  0.0  2.4 693756 Sl
 4132 20276   86  0.0  2.4 693756 Sl
 4132 20305   86  0.0  2.4 693756 Sl
 4132 21910   86  0.0  2.4 693756 Sl
 4132 21933   86  0.0  2.4 693756 Sl
 4132 23395   86  0.0  2.4 693756 Sl
 4132 24450   86  0.0  2.4 693756 Sl
 4132 26244   86  0.0  2.4 693756 Sl
 4132 26246   86  0.0  2.4 693756 Sl
 4132 26247   86  0.0  2.4 693756 Sl
 4132 26268   86  0.0  2.4 693756 Sl
 4132 26287   86  0.0  2.4 693756 Sl
 4132 26312   86  0.0  2.4 693756 Sl
 4133 26318   86  0.0  2.4 693756 Sl
 4132 26323   86  0.0  2.4 693756 Sl
 4132 26325   86  0.0  2.4 693756 Sl
 4132 27005   86  0.0  2.4 693756 Sl
 4132 27006   86  0.0  2.4 693756 Sl
 4134 27065   86  0.0  2.4 693756 Sl
 4132 27184   86  0.0  2.4 693756 Sl
 4132 27209   86  0.0  2.4 693756 Sl
 4132 27212   86  0.0  2.4 693756 Sl
 
You can see that all the other threads are near 0 percent of the cpu except thread 4179, which is at 27.6 percent.  Another hint is the state, which shows all the other threads in a sleeping (Sl) state, while this thread is in a running (RI) state.
To get more information on this thread, we can look at the gstack output that was taken at the same time.  Searching for “LWP 4179†we can find the stack trace, which looked like this:
 Thread 74 (Thread 0x4170f940 (LWP 4179)):
#0  0x00002b13e1b7d18c in sendto () from /lib64/libpthread.so.0
#1  0x00002aaaaaae4d6c in NCPServer::SendBroadcastPing(int, char, unsigned int) () from /opt/novell/eDirectory/lib64/nds-modules/libncpengine.so
#2  0x00002aaaaaaedb77 in PingL1Oplock(int, int, unsigned int, CacheEntry*) ()
#3  0x00002aaaaaaf3ff7 in HandleLeaseSignal(int) ()
#4  0x00002aaaaab076fd in DircacheMaintenanceEvent() ()
#5  0x0000000000414561 in PoolWorker(void*) ()
#6  0x00002b13e1b76193 in start_thread () from /lib64/libpthread.so.0
#7  0x00002b13e1e9ff0d in clone () from /lib64/libc.so.6
 
In this case the thread was checking the status of an open file and the “sendto†function did not know how to interpret the result code it was getting back.  Since we can see the library in #1 was referencing libncpengine, we know this thread has something to do with filesystem ncp.   Turning on full ncp logging as described in #7 above, e.g. “ncpcon log everythingâ€, we saw the following in /var/opt/novell/log/ncpserv.log:
[! 2011-02-11 09:50:53] SendBroadcastPing: sendto() err:32, conn:555 errno:32, errmsg:Broken pipe
We could see this message scrolling over and over and was actually growing the ncpserv.log by about 100mb every 5 minutes!  Anyway, based on this message we could see the error was coming from connection 555.  Typing “ncpcon connection 555†provided more specific information about that connection, such as the workstation IP, the user who was logged in, the files that were open, etc…
This particular issue ended up being a bug in the code and is addressed in Bug #671487. 
This is just an example of how the ps and gstack can help one narrow in on the problem.  It is a good idea to capture at least four of five ps and gstack output taken in parallel every few seconds to have a base of comparison. 
Note:  The following script is provided as an example and is very basic. As is, the script writes ps, ndsd thread, gstack and ncp connection information to /var/log/ndsd_util.txt
It may also be a good idea to have cron execute this script every 5 minutes or something like that until the cause has been identified.  The following steps illustrate how to enable this script 
 
1. Copy the script below and name it ndsd_info.sh then place it in /root/bin directory. (ensure any text formatting is removed)
2. Type “crontab –e†as root
3.     Add “*/5 * * * *  /root/bin/ndsd_info.sh†to execute the command every 5 minutes and then save.
 
#### Start of Script
#!/bin/bash

. /opt/novell/eDirectory/bin/ndspath &> /dev/null

## Variables
LOGDIR=/var/log
VARDIR=$(ndsconfig get n4u.server.vardir | awk -F"vardir=" '{print $2}')
NDSD_PID=`cat $VARDIR/ndsd.pid`
LOG_OUTPUT=$LOGDIR/ndsd_util.txt
NDSD_PS_COMM="/bin/ps -C ndsd -L -o pid,tid,nlwp,pcpu,pmem,vsz,stat"
NDSD_THREADS_COMM="ndstrace -c threads"
NCP_THREADS_COMM="ncpcon threads"
GSTACK_COMM="/usr/bin/gstack $NDSD_PID"
NCPCONBIN=/sbin/ncpcon
NCP_CONN_SUMMARY="$NCPCONBIN connections"
NCP_CONN="$NCPCONBIN connections list"
HEADER="-------------------------------------------------------------------"
FOOTER="==================================================================="

addspace() {
echo "">> $LOG_OUTPUT
}

addheader() {
echo $HEADER >> $LOG_OUTPUT
addspace
}

addfooter() {
  addspace
  echo $FOOTER >> $LOG_OUTPUT
}

##Grab ps output from ndsd
ndsd_ps() {
  echo "BEGIN PS_OUTPUT at $(date)">> $LOG_OUTPUT
  addheader
  echo "Command: $NDSD_PS_COMM" 1>> $LOG_OUTPUT
  addspace
  $NDSD_PS_COMM >> $LOG_OUTPUT
  addspace
  echo "END PS_OUTPUT at $(date)">> $LOG_OUTPUT
  addfooter
}


##Grab output from ncpcon threads
ncp_threads() {
  echo "BEGIN NCP_THREADS at $(date)">> $LOG_OUTPUT
  addheader
  echo "Command: $NCP_THREADS_COMM" 2>/dev/null 1>> $LOG_OUTPUT
  $NCP_THREADS_COMM 2>/dev/null 1>> $LOG_OUTPUT
  addspace
  echo "END NCP_THREADS at $(date)">> $LOG_OUTPUT
  addfooter
}


##Grab output from ndstrace -c threads
ndsd_threads() {
  echo "BEGIN NDSD_THREADS at $(date)">> $LOG_OUTPUT
  addheader
  echo "Command: $NDSD_THREADS_COMM" 2>/dev/null 1>> $LOG_OUTPUT
  addspace
  $NDSD_THREADS_COMM | grep -v Instance 2>/dev/null 1>> $LOG_OUTPUT
  addspace
  echo "END NDSD_THREADS at $(date)">> $LOG_OUTPUT
  addfooter
}


##GSTACK
ndsd_gstack() {
  echo "BEGIN GSTACK at $(date)">> $LOG_OUTPUT
  addheader
  echo "Command: $GSTACK_COMM" 1>> $LOG_OUTPUT
  addspace
  $GSTACK_COMM 1>> $LOG_OUTPUT
  addspace
  echo "END GSTACK at $(date)">> $LOG_OUTPUT
  addfooter
}


##Grab NCP connection summary and connection information
ndsd_connections() {
  echo "BEGIN CONNECTIONS at $(date)">> $LOG_OUTPUT
  addheader
  echo "Command: $NCP_CONN_SUMMARY">> $LOG_OUTPUT
  addspace
  $NCP_CONN_SUMMARY 2>/dev/null 1>> $LOG_OUTPUT
  addspace
  echo "Command: $NCP_CONN" 1>> $LOG_OUTPUT
  $NCP_CONN 2>/dev/null 1>> $LOG_OUTPUT
  addspace
  echo "END CONNECTIONS at $(date)">> $LOG_OUTPUT
  addfooter
}

## Grab output from each defined function ensuring each one completes before going on

ndsd_ps && ndsd_gstack && ndsd_threads && ncp_threads && ndsd_connections

echo "############### Finished at $(date) ##############">> $LOG_OUTPUT


### End of Script