Identity Server stops responding to requests when one of it's SSL enabled user stores is down

  • 7012562
  • 06-Jun-2013
  • 06-Jun-2013

Environment

NetIQ Access Manager 3.2
NetIQ Access Manager 3.2 Support Pack 1 applied
NetIQ Access Manager 3.2 Identity Server

Situation

Access Manager setup and working well across two different data centers - 4 Identity Servers (IDP) in both data centers but in the same IDP cluster; 4 AGs in both data centers and in the same AG cluster. Each IDP server points to 4 different LDAP servers, all communicating with the IDP server over SSL.

On two occasions, the IDP servers in one site only stopped handling requests from any users - all 4 on one occasion, and 3 of the 4 on another. The Admin Console would report the IDP servers as 'not reporting', and if the admin tried to hit the liberty or saml metadatua url it would just hang without a response. No users could authenticate to the IDP servers in this state.

Resolution

Fixed in 3.2 Sp2.

Cause

The issue is tied to tcp connection issues on a highly loaded userstore. This issue occurs when IDP tries to initialize connection pool with userstore. During this time, socket connections are established but, due to no timely response from a specific userstore, the IDP hangs waiting indefinitely for the response.

The logs do show the issue, and dumping the java threads with the 'kill -QUIT' option confirms what threads are lingering indefinitely. The ssl socket time out is not working in this failure case: we can see that a few thread are blocked on connection state, a few threads got blocked in close state and few more are blocked in creating the connection.

Had all the ssl socket connections timed out then the thread would have been unblocked. but it doesn't appear this happened. We can see that the user stores are in restart mode from the catalina log files:

 <amLogEntry> 2012-11-26T18:44:49Z DEBUG NIDS Application: Method: NIDPServletContext.healthCheckUserStores Thread: http-bio-/10.201.3.156-8443-exec-24 For User Store #5: some replicas are in restart: ukbed.acorp.int </amLogEntry> <amLogEntry> 2012-11-26T18:44:49Z DEBUG NIDS Application: Method: NIDPServletContext.healthCheckUserStores Thread: http-bio-/10.201.3.156-8443-exec-24 For User Store #10: all replicas are in restart! </amLogEntry>

Few threads are in connection state with ssl connection:

"Thread-166811" daemon prio=10 tid=0x0000000040e69000 nid=0x42cb runnable [0x00007f86b4965000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293) at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:331) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:830) - locked <0x00000000f2a8c530> (a java.lang.Object) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:787) at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75) - locked <0x00000000f2a8c280> (a com.sun.net.ssl.internal.ssl.AppInputStream) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) - locked <0x00000000f2a8c258> (a java.io.BufferedInputStream) at com.sun.jndi.ldap.Connection.run(Connection.java:820) at java.lang.Thread.run(Thread.java:662)

Health checks are blocked on pool open or not check

"RMI TCP Connection(27915)-127.0.0.1" daemon prio=10 tid=0x000000004038c000 nid=0x31be waiting for monitor entry [0x00007f86b4e62000] java.lang.Thread.State: BLOCKED (on object monitor) at com.novell.nam.common.ldap.jndi.JNDIUserStoreReplicaConnectionPool.isOpen(JNDIUserStoreReplicaConnectionPool.java:1196) - waiting to lock <0x00000000c16ecfe8> (a com.novell.nam.common.ldap.jndi.JNDIUserStoreReplicaConnectionPool) at com.novell.nam.common.ldap.jndi.JNDIUserStoreReplica.getReplicaLoad(JNDIUserStoreReplica.java:283) at com.novell.nam.common.ldap.jndi.JNDIUserStore.getLoad(JNDIUserStore.java:654) at com.novell.nam.common.ldap.jndi.JNDIUserStoreManager.getLoad(JNDIUserStoreManager.java:298) at com.novell.nam.common.ldap.jndi.JNDIUserStoreManager.getAverageLoad(JNDIUserStoreManager.java:306) at com.novell.nidp.servlets.NIDPServletContext.healthCheckThrottle(y:2480) at com.novell.nidp.servlets.NIDPServletContext.getHealth(y:757) at com.novell.jcc.client.JCCClientImpl$JCCRMIClient.getHealth(y:3499) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Few threads are trying to make connection But blocked at connection pool is open or not check "http-bio-/10.201.3.156-8443-exec-11" daemon prio=10 tid=0x0000000040165800 nid=0x5619 waiting for monitor entry [0x00007f86b4dde000] java.lang.Thread.State: BLOCKED (on object monitor) at com.novell.nam.common.ldap.jndi.JNDIUserStoreReplicaConnectionPool.isOpen(JNDIUserStoreReplicaConnectionPool.java:1196) - waiting to lock <0x00000000c16ecfe8> (a com.novell.nam.common.ldap.jndi.JNDIUserStoreReplicaConnectionPool) at com.novell.nam.common.ldap.jndi.JNDIUserStoreReplica.getReplicaLoad(JNDIUserStoreReplica.java:283) at com.novell.nam.common.ldap.jndi.JNDIUserStore.getLoad(JNDIUserStore.java:654) at com.novell.nam.common.ldap.jndi.JNDIUserStoreManager.getLoad(JNDIUserStoreManager.java:298) at com.novell.nam.common.ldap.jndi.JNDIUserStoreManager.getAverageLoad(JNDIUserStoreManager.java:306) at com.novell.nidp.servlets.handler.BaseHandler.<init>(y:795) at com.novell.nidp.servlets.handler.ProtocolHandler.<init>(y:3561) at com.novell.nidp.liberty.IDFFHandler.<init>(y:2819) at com.novell.nidp.liberty.IDFFHandler.getInstance(y:2360) at com.novell.nidp.liberty.LibertyMeDescriptor.handleRequest(y:2859) at com.novell.nidp.servlets.NIDPServlet.myDoGet(y:1124) at com.novell.nidp.servlets.NIDPBaseServlet.doGet(y:1599) at javax.servlet.http.HttpServlet.service(HttpServlet.java:621) at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)