Troubleshooting slow logins and unresponsive DSfW server

  • 7010462
  • 16-Jul-2012
  • 28-Dec-2017

Environment

Open Enterprise Server 2018
Open Enterprise Server 2015 (OES 2015) Linux Support Pack 1
Novell Open Enterprise Server 11 Support Pack 2 (OES11SP2)
Novell Open Enterprise Server 11 Support Pack 1 (OES11SP1)
Novell Open Enterprise Server 2 Support Pack 3
Domain Services for Windows
DSfW

Situation

DSfW server - slow logins
DSfW server - Logins take more than a minute
DSfW server - unresponsive
DSfW server - Poor performance
DSfW server - Slow performance
DSfW server - ndsd and or xadsd crash
/var/log/messages shows "winbindd: Exceeding 200 client connections, no idle connection found"
Users cannot authenticate

Resolution

When troubleshooting a DSfW server for slow logins the most common issue is due to the kdc receiving too many invalid request.  Utilization, memory, and gstacks for both ndsd and xadsd are also areas to check.

Kerberos
At this time the kdc is single threaded (version 1.5) and when more request or request that take longer to process are to be processed by kerberos, requests will be queued and held in memory until the request can be processed.  If too many requests are queued, all available memory can be consumed thus triggering the OOM which kills the process consuming the most about of memory which is usually ndsd.  Below are the most common reasons for the slow response.

Decrypt integrity check failed - bad password
    Look at the next line for the user or workstation trying to login with a bad password.  Workstations will have a $ at the end of the name and before the @domain name.

Example:
AS_REQ 192.168.0.4 PREAUTH_FAILED: <workstation$@dsfw.lan> for krbtgt/dsfw.lan Preauthentication failed

A way to search sort and count computers and users with this the Decrypt integrity check failed error is:

grep -A1 -i 'Decrypt integrity check failed' /var/opt/novell/xad/log/kdc.log |grep -v 'Decrypt integrity check failed' |awk -F ')' '{print $3}' |grep -v '^$' |awk -F 'for' '{print $1}' |sort -n | uniq -ci | sort -n | sed -e s/PREAUTH_FAILED:/BAD_PASSWORD:/g

The decrypt integrity check errors can cause slow logins, cause the domain controller to become unresponsive, or even crash the domain controller.  Implement an intruder lock out on the container where the user(s) or workstation(s) objects reside.  If there is not an intruder lock out on the container it is common to see workstations with invalid password attempt to log in every 3 to 5 seconds.  One workstation attempting to login every 3 seconds can cause some slowness, 3 or more workstation can cause logins to slow down by more than 10 minutes or eventually crash ndsd or xadsd.  TID 7006851 shows a process to enable intruder detection in a GPO and set up WINS to help reduce workstations joining the domain with a duplicate name.

Every 30 days by default a workstation changes its password.  If the workstation changed its password, but the computer object did not receive the set password request and continues with the old password.  When the workstation attempts to login, it fails because of the incorrect password.  Either rejoin the computer or reset the computer account following MS KB 216393 or MS Article 849751.

If the computer object is moved or installed into a different container, the container must have the cn=Default Password Policy,cn=Password Policies,cn=System,<domain_container>  password policy assigned.  This password policy is specifically designed to work with workstations and must be assigned to containers with computer objects.

Do the following commands to find all computer objects and check that a password policy is assigned to the container:
1) set the DEFAULTINGCONTEXT variable
DEFAULTNAMINGCONTEXT=`/usr/bin/ldapsearch -x -b "" -s base DEFAULTNAMINGCONTEXT | grep -i 'DEFAULTNAMINGCONTEXT: ' | awk '{print $2}'`
2) export the ldap.conf
export LDAPCONF=/etc/opt/novell/xad/openldap/ldap.conf
3) do the search for computer objects.
/usr/bin/ldapsearch -Y EXTERNAL -b "$DEFAULTNAMINGCONTEXT" -s sub -LLL -Q '(&(objectclass=computer)(1.2.840.113556.1.4.221=*)(1.2.840.113556.1.4.782=*))' dn |sed -e :a -e '$!N;s/\n //;ta' -e 'P;D' | cut -d: -f2 | cut -d, -f2-12 | sort -u | grep -iv ^dc= |grep -v ^//
4) do the search for password policy assignements
/usr/bin/ldapsearch -Y EXTERNAL -Q -LLL -b "" -s sub '(&(nspmPasswordPolicyDN=*))' nspmPasswordPolicyDN| sed -e :a -e '$!N;s/\n //;ta' -e 'P;D'
5) validate that every container returned in step 3 is reported as having a password policy assigned from step 4.


locked out - account has been locked out

A way to search sort and count computers and users with this error is:

grep -i 'locked out' /var/opt/novell/xad/log/kdc.log |cut -d ')' -f3 |awk -F 'for' '{print $1}' |sort -n | uniq -ci |sort -n

If this is for a workstation account this error message usually means their is a workstation with the same name trying to login.  The workstation with the duplicate name will attempt to login several time, triggering the intruder lockout and thus generating the Decrypt integrity check error.
If this is a user, the user account is locked usually do to intruder lockout.


Along with the kdc.log, use ldapseach to return computer accounts that are currently locked out.  If thousands of computers are joined to a domain the kdc.log potentially will roll over to a new log before a substantial list of workstations is built up.  A workstation might be listed once or twice or not at all because the kdc.log was recently rolled over.  To counter that you can parse multiple logs together or just use ldapsearch to return a list of computer accounts that are currently locked out.

/usr/bin/ldapsearch -Y EXTERNAL -b "$DEFAULTNAMINGCONTEXT" -s sub -LLL -Q '(&(objectclass=computer)(lockedByIntruder=TRUE)(1.2.840.113556.1.4.221=*)(1.2.840.113556.1.4.782=*))' dn

Add > /tmp/computersLocked.txt to the end of the ldap search to send to the file. /tmp/computersLocked.txt

client not found - account is not found in domain

A way to search sort and count computers and users with this error is:

grep -i 'client not found' /var/opt/novell/xad/log/kdc.log |cut -d ')' -f3 |awk -F 'for' '{print $1}' |sort -n | uniq -ci |sort -n

Similar to a 601 in eDirectory.  Take a ndstrace with +time +tags +auth +ldap +vcln to see the search request.  Check that the account exists in the domain and is samified.  Client not Found errors can cause slow login.
A common cause for this error is an application that is attempting to login with a user that no longer exists or a computer is attempting to login were the computer object has been deleted.

If the IP address listed with the error can not be found or disabled, enabling the firewall on the DSfW server and blocking the IP address is a good way to prevent a rough computer from attempting to log in multiple times.
To fix the workstations rejoin them to the domain.

Group Types
DSfW like AD has three types of groups.  Domain Local, Global, and Universal.  The default group type is Universal.
Slow logins can be a result of group type.  Global and Universal groups calculate a virtual attribute called tokenGroupsDomainLocal.  This attribute is calculated for the group by the slapi layer.  When a user is a member of several groups login times can increase.  An increase in ndsd utilization can also result from the calculation of the tokenGroupsDomainLocal when a large number of groups reside within the domain.

If ndsd utilization is high or login times need to be reduced and a group type of Univeral or Global is not needed, change groups to Domain Local groups to avoid the calculation of the tokenGroupsDomainLocal virtual attribute.

MMC, iManager, or an ldif can be used to change the group type.  With iManager use the other tab to change the value for the group type.

The value for a Domain Local group type is 2147483644

See TID 7011498 for more information on Group Types causing utilization and 7004405 for more information on Group Types.

Utilization and Memory
Monitor the utilization and memory on the DSfW server.  Look for patterns and trends in the utilization on the server or if the server is running out of memory.
There are several ways to get just ndsd utilization.  The two main tools are top (in batch mode) and ps.

top
Show only ndsd process utilization
top -b -n1 -p `pidof ndsd`
Show only ndsd utilization and all ndsd threads - add -H
top -b -n1 -H -p `pidof ndsd`
Show utilization for all processes
top -b -n1

ps
ps -C ndsd -L -o pid,tid,nlwp,pcpu,pmem,vsz,stat

Memory
For overall server memory, both physical and swap, common tools to user are the meminfo tool and free.  If the server is running out of memory it is useful to tack the servers memory over a period of time until the server has consumed all memory or is close to utilizing all memory.  The ps and top commands can also display memory along with these two tools.

meminfo
cat /proc/meminfo

free
Display memory in kilobytes or megabytes.
/usr/bin/free -k
/usr/bin/free -m
Use the watch command with free to monitor in a console
watch -n 10 -d free -k

Overall stats
To gather over all ndsd or xadsd memory, pid info, number of threads, etc cat the /proc/pid of process/status
cat /proc/`pidof ndsd`/status
cat /proc/`pidof xadsd`/status

eDirectory Threads
Check that ndsd is not running out of thread.  If ndsd runs out of server threads this can cause the server to hang and ndsd's performace to severely denigrate.

ndstrace
ndstrace -c threads

proc threads
cat /proc/`pidof ndsd`/status |grep -i Threads:

The thread setting can be seen with the command: ndsconfig get |grep n4u.server.max-threads
And set to say 512 with the command: ndsconfig set |grep n4u.server.max-threads = 512

eDirectory
eDirectory health and configuration is critical to DSfW.  Tuning eDirectory memory can help with DSfW performance.  Follow TID 3178089 for general guidelines.  TID 7002682 will help in instructing how to change eDirectory cache preallocation.

eDirectory memory
After tuning eDirectory memory and eDirectory is still consumes an abundant amount of memory look at TID 7002714.
Valgrind is a tool that can be used to troubleshoot memory leaks.  This is a rare occurrence, but sometimes is necessary to find a potential memory link in an environment. TID 7005905 gives instructions on how to use Valgrind to troubleshoot eDirectory.


NCP Connections
Hopefully the DSfW server is not used for ncp authentications.  It still might be helpful to check that ncp threads are not exhausted and to know the number of valid connections and authenticated connections.  Tools to use are ncpcon, ndstrace, and the ncpserv.log.

ncpcon
/sbin/ncpcon threads 2>&1
/sbin/ncpcon stats 2>&1

ndstrace
/opt/novell/eDirectory/bin/ndstrace -c connections | grep -v Instance 2>&1
/opt/novell/eDirectory/bin/ndstrace -c connections | grep -v Instance |grep "VALID|AUTHEN" 2>&1

/var/opt/novell/log/ncpserv.log
tail -100 /var/opt/novell/log/ncpserv.log |egrep "DirCache| cached |evicted"

Samba Connections
Winbind by default only allows 200 simultaneous connections.  It is possible that all available connections are used.  If this is the case then installing an additional domain controller will help alleviate the simultaneous connections issue.  The smbstatus can be used to view connections.

smbstatus
The command can be used with the -v (verbose) or with out
smbstatus -v

gstacks
While the server is non responsive or before and during the server is unresponsive taking gstacks can help show what a process is doing.  Taking a gstack every 30 seconds to a minute while the server is non responsive will help narrow down thread(s) that are hung and thus causing the server to be unresponsive.  The two main processes the gstacks should be taken for or ndsd and xadsd.
gstack `pidof ndsd`
gstack `pidof xadsd`

Linux User Management
(LUM or namcd)
Check that LUM is running and responding quickly.  It is possible that LUM is causing high utilization or unresponsive because the preferred server is down and no alternative ldap server is listed.  Adding the DSfW server as the preferred server is recommended as well as having at least one alternative ldap server, usually the eDir server used for the install.  If there are other DSfW servers in the domain list them as well as the eDir server as an alternative ldap server.  Use a comma as the delimiter.  Do not list NetWare servers as preferred or alternative servers.  The class and attribute mappings are different and will not return values. 

Check that the persistent-search setting.  By default this should be set to no, but if it is set to yes, it has been known to cause hangs.  See TID 7006086 on this issue.

It is possible to also disable Persistent Search on the ldap server object and not just for LUM.
Use iManager or ldapconfig to disable Persistent Search
To check the setting using ldapconfig do:
ldapconfig get ldapEnabledPSearch
At the prompt enter the FQN for admin in .x500 format example: admin.novell
To set the Persistent Search do no do the following:
ldapconfig set "ldapEnabledPSearch=no"

After disabling Persistent Search on the ldap server restart the following services:
rcndsd
rcnamcd
rcnscd
rcowcimomd

The log level can be increased to give additional information.  Default is 0 and the max is 5.
To change the log level use namconfig
namconfig set log-level=5

The logs to check for troubleshooting LUM are:
/var/log/messages
/usr/lib/novell-lum/nam.log
/var/log/boot.msg
To check the log-file-location do the following: namconfig get |grep log-file 
By default it is not set and uses the /usr/lib/novell-lum/nam.log but this can be changed to a different location if desired.
If there is a setting like /var/log/ then the file will be /var/log/namcd.log

For more troubleshooting of LUM see TID 7002981.

Cores
If a process crashes look for a core file.  If one or more DSfW services has crashed most likely there is a core.  You may be asked to send the core to NTS.  Cores often help in identifying the condition the daemon was in and at which function the processed cored.

Core Location
ndsd the core file will be located with the dib which by default is /var/opt/novell/eDirectory/data/dib/.  For some services the core will be located in the root directory /. 

Here are a few services and their core locations:
ndsd   /var/opt/novell/eDirectory/data/dib/core
smbd /var/log/samba/core/smbd/ or /
nmbd /var/log/samba/core/nmbd/
winbindd /var/local/dumps/core.winbindd.#
rpcd /var/local/dumps/core.rpcd.#
xadsd usually in /
namcd usually in /

If a core is in the / the strings command usually helps to see which process responsible for the core.
strings core.#### |grep DAEMON

MALLOC_CHECK_
For ndsd cores most likely the first core will not provide the information needed.  The google memory allocator or MALLOC needs to be disabled to prevent memory corruption.  

To make this setting modify the pre_dnsd_start script and add the following lines to the top of the script and restart ndsd.
MALLOC_CHECK_=3
export MALLOC_CHECK_
Then restart NDSD.

NOTE: eDirectory on SLES 12 or RHEL 7:  You must add all environment variables required for the eDirectory service in the env file located in the /etc/opt/novell/eDirectory/conf directory.

To check the malloc setting on a server do
strings /proc/`pidof ndsd'/environ |grep -i MALLOC_CHECK_

To check the malloc setting on a core file do
strings core.#### |grep -i MALLOC_CHECK_

ulimit
The ulimit for the process might need to be adjusted if no core file is created.  There are a number of was to make this setting

As the root user enter in the terminal
ulimit -c unlimited

Set the ulimit to unlimited in the /etc/init.d/<process script> on the second line directly under #!/bind/bash enter
ulimit -c unlimited

To view the ulimit
ulimit -a

To view the ulimit for user logged in
ulimit -c

Disable the limit for the maximum size of a core dump fileset globally edit the /etc/security/limits.conf
unrem the line 
#*               soft    core            0
and change it to 
*               soft    core            unlimited

The /etc/profile might be preventing a core as well.  It is common to have a ulimit value set to '0' or see 'ulmit -Sc 0' in the /etc/profile.
If this is preventing a core change the setting to 'ulimit -Sc unlimited'

novell-getcore
To gather a core and necessary libraries so that NTS can read the core the novell-getcore script can be used.  It is installed on OES servers by default.  GDB should also be installed on the server.  If not please install gdb before using the novell-getcore script.

To use the novell-getcore script so the NTS can analyz it use the following syntax:
novell-getcore -b /<path to core>/<core file> /<path to binary>/<binary for the process>

Use the command strings core.#### |grep -i DAEMON if unsure which process caused the core.

eDir 8.7.3 example:
novell-getcore -b /var/nds/dib/core.#### /usr/sbin/ndsd

eDir 8.8 and later example:
novell-getcore -b /var/opt/novell/eDirectory/data/dib/core.#### /opt/novell/eDirectory/sbin/ndsd

Samba example:
novell-getcore -b /core.#### /usr/sbin/smbd

Winbindd example:
novell-getcore -b core.### /usr/sbin/nmbd

For more on ulimits, MALLOC_CHECK_ setting, and novell-getcore see TID 3078409

gdb
gdb can be used while the process is running.  Be sure to find which daemon caused the core first if it is located at the root of the filesystem.
strings core.#### |grep DAEMON

eDir (ndsd) example:
gdb `which ndsd` `pidof ndsd`

Samba example:
First get the pid since there usually is more than one pid for smbd the pidof command will not work.
ps -eaf |grep smbd or pidof smbd and select a pid, usually the second pid.
gdb `which smb` 30350

Winbindd example:
First get the pid since the pidof command will not work with winbindd then use gdb
ps -eaf |grep nmbd or pidof nmbd
gdb `which nmbd` 34560

Force a Core
If the server is hung but does not core it might be necessary to force a core.  To do so check the ulimit for the process as mentioned above.
Usually setting the ulimit at the terminal works 
ulimit -c unlimited
If that does not work set the ulimit to unlimited in the /etc/init.d/<process script> script on the second line directly under #!/bind/bash.

To force the core, use gdb for the specific process and type gcore at the prompt or gcore and the pid for the process.

To use gdb for a process the syntax is:
gdb <path to binary> <pid>
or
gcore <pid>

eDir (ndsd) forced core examples:
gdb /opt/novell/eDirecotory/sbin/ndsd `pidof ndsd`gcore
gdb `which ndsd` `pidof dnsd` gcore
gdb /opt/novell/eDirecotory/sbin/ndsd 23450 gcore 23450 is the pid # which can be found using ps -eaf |grep ndsd
gcore 23540

Samba forced core examples:
First get the pid using ps -eaf |grep smbd
gdb `which smb` 30350` gcore
or if you know the path and pid# 
gdb /usr/sbin/smbd 30350 gcore
gcore 30350

Winbindd forced core:
First get the pid using ps -eaf |grep nmbd
gdb `which nmbd` 34560
gcore 34560

Another option to trigger a core is to send SIGABRT to the process
example:
kill -ABRT `pidof ndsd`
or if this is not working enter the pid number which can be discoverd using the ps -eaf |grep <process name>
ps -eaf |grep ndsd
kill -ABRT 2334

Log gdb output
If novell-getcore is not properly bundling up the core many or you would like to have a log file created while looking at the core with gdb, tee the output then enter the commands at the prompt in gdb.  The output of those commands will be send the the specified log file.  Type quit to exit out of gdb then look at the log file specified for the output of  each command.

eDir (ndsd) example:
gdb /opt/novell/eDirecotory/sbin/ndsd -c /var/opt/novell/eDirectory/data/dib/core.#### | tee /root/ndsdcore.log

Samba example:
gdb /usr/sbin/smbd-c /core.#### | tee /root/sambacore.log

Winbindd example:
gdb /usr/sbin/winbindd -c /var/log/samba/cores/winbindd/core.#### | tee /root/winbindcore.log

Useful outputs to gather from the core while in gdb are:
(gdb) bt - shows the stack
(gdb) info threads - shows thread summary
(gdb) thread apply all where - shows details on threads
(gdb) info sharedlibrary - show shared library
(gdb) info all-registers - shows all registers
(gdb) info args - show arguments
(gdb) info frame - information on frame
(gdb) info reg - information on registers in the current frame
(gdb) disass - disassemble the current from
(gdb) info locals - lists local variables in current stack frame
(gdb) fr 1 - to go to frame 1 - starts at from 0.  The number of frames can be seen in the stack (bt command) 
(gdb) quit