NDSD Segfault at 9e4 ip 00007ffd898e5bef sp 00007ffd8b575250 error 4 in libncpengine.so.0.0.0.

  • 7015747
  • 07-Oct-2014
  • 16-Oct-2014

Environment

Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2
May 2014 Scheduled Maintenance patches

Situation

On a large multi-node OES 11 SP2 cluster, random nodes were constantly crashing with a segfault in NDSD (libncpengine.so.0.0.0) after one of the eDir replica servers turned out to have become unresponsive for requests from the network.

Since it was rebooted to restore functionality, not much is known about the exact state of this eDir master server at the time of the events, other than that no processes had actually crashed, but it was observed that connections on this server were stuck in a CLOSE_WAIT state.

Due to the continuous flood of crashes across different cluster nodes, and the effect for the user, it was decided to reboot the eDir server after which the cluster nodes stopped crashing.

Backtraces for several cores reveal common code paths  :

#bt
#0  EnumConnectionInformation (currentConnection=26, infoRequestMask=<optimized out>, ListCount=1, connList=0xd88e07e,
    BufferSize=<optimized out>, infoItems=0xfea8008, buffer=0xfea800c "\034", bytesUsed=0x7ffd8b5754fc) at ../../engine/ncpserv/cmgrTable.cpp:564
#1  0x00007ffd89942765 in Case123 (connectionNumber=26, tid=<optimized out>, req=0x7ffd8b575a00, req_packet_len=<optimized out>)
    at ../../engine/ncpserv/ncpdStats.cpp:952
#2  0x00007ffd8993fb27 in ExecuteNCPPacket (connectionNumber=26, req=0x7ffd8b575a00, req_packet_length=15)
    at ../../engine/ncpserv/ncpdServer.cpp:147
#3  0x00007ffd8990fbc3 in INCP::HandleNCPFileServiceRequest (this=0x891120) at ../../engine/incp.cpp:3107
#4  0x00007ffd89911845 in INCP::Process (this=0x7ffd8b575ca0, forNCPFileServices=1, handler=0x0) at ../../engine/incp.cpp:3012
#5  0x00007ffd89911b3b in INCP::HandleNCPRequest (this=0x7ffd8b575ca0, receiveBuffer=0xd88e000, waitTillDoneFlag=0,
    returnedBufferFlag=0x7ffd8b575c5c) at ../../engine/incp.cpp:712
#6  0x00007ffd8991282b in INCP::ServiceStreamGroupConnections (this=0x7ffd8b575ca0, ssg=0x7ffd89bc0780) at ../../engine/incp.cpp:1393
#7  0x00007ffd89912eea in NCPPollerThread (ssg=0x7ffd89bc0780) at ../../engine/incp.cpp:252
#8  0x000000000041a072 in PoolWorker (data=0xe14a660) at /usr/src/debug/novell-NDSbase-8.8.8.2/nds-8.8.8.2/unix/dhost/ddstpool.cpp:402
#9  0x00007ffd8ca8e7f6 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffd8c05509d in capget () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
#

and another example :

#bt
#0  EnumConnectionInformation (currentConnection=29, infoRequestMask=<optimized out>, ListCount=1, connList=0xd53807e, BufferSize=<optimized out>,
    infoItems=0xf07b008, buffer=0xf07b00c "\a", bytesUsed=0x7ff9d83594fc) at ../../engine/ncpserv/cmgrTable.cpp:564
#1  0x00007ff9d7c3a715 in Case123 (connectionNumber=29, tid=<optimized out>, req=0x7ff9d8359a00, req_packet_len=<optimized out>)
    at ../../engine/ncpserv/ncpdStats.cpp:687
#2  0x00007ff9d7c37ad7 in ExecuteNCPPacket (connectionNumber=252162064, req=0x2, req_packet_length=0) at ../../engine/ncpserv/ncpdServer.cpp:138
#3  0x0000001d007ae000 in ?? ()
#4  0x0000000000002222 in ?? ()
#5  0x000000000d538068 in ?? ()
#6  0x00007ff9d8359a90 in ?? ()
#7  0x0000000000000000 in ?? ()
#


Resolution

Code was put in place to prevent the connection from becoming NULL.

Cause

The EnumConnectionInformation(currentConnection,..,*connList,..) function tries to enumerate each of the connections from the list, and eventually crashed because a connection number became NULL where it previously was a valid connection number.

Additional Information

Due to the fact that we can not reliably reproduce the problem of unresponsive TCP connections that is believed to led up to this crash, and until problem duplication or a test-case for current code base will allow us to confirm the solution that was put in place, the solution ported to the next release of OES (OES2015).