Degraded backup performance and segfaults in AFP and Adminusd and problems managing User quota's since applying the March 2012 Scheduled Maintenance update

  • 7005336
  • 26-Apr-2012
  • 16-May-2013

Environment

Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 3
Novell Open Enterprise Server 11 (OES 11) Linux
Novell Cluster Services 1.8.4
Novell AFP
3rd party Backup applications

Situation

The novell-ncpns library as released with the August 2011 Scheduled Maintenance update for OES2 SP3, is novell-ncpns-5.3-0.12.
The novell-ncpns library as released with the March 2012 Scheduled Maintenance update for OES2 SP3, is novell-ncpns-5.3-0.16.

Background  :
Non SMS compliant backup applications cannot by default backup file system meta-data (such being trustees and extended attributes) from NSS volumes, and require the xattr functionality in order to be able and backup the same.

In order to be able and backup NSS meta-data using  xattr , the following two lines need to be added to '/etc/opt/novell/nss/nssstart.cfg' :
/ListXattrNWMetadata
/CtimeIsMetadataModTime


The initial problem :
A problem was reported when non SMS compliant 3rd party backup applications which are using the xattr functionality as described above, needed to restore previously backed up data, than the rights to the file as they were backed up, were completely missing after the restore.


The particular problem exhibited itself as follows :
If a server would need to restore a right (ACL) for a user, but eDir on this server would not know about this particular user (because for example the server does neither hold an eDir replica, nor contain an external reference) then the restore of the rights where just "skipped" (as in 'lost').

The restore of rights, for data this is backed up using xattr, would only succeed if the server that was being used to restore to already holds an eDir replica (or an external reference) for the object whose rights were being restored.

This problem would be even worse if a complete site server was lost, and in order to get rapid access to the data again, the customer decided to restore the volume to another site server. This other site server would unlikely hold ANY external reference information about users of the crashed site, and hence no rights were restored at all.

The way a SMS backup works using xattr functionality, is that it fetches the required ID's (being: owner, archiver, modifier and meta-data modifier ID's) from NSS, and passes these details on to NCP to obtain the corresponding DN.

The problem mentioned above was duplicated in-house, root cause was identified, and the solution to the above described problem was tested and confirmed during in-house testing prior to public release.

With the release of the August 2011 Scheduled Maintenance update for OES2 SP3, Novell has released novell-ncpns-5.3-0.12  as the solution for the described scenario above, and made this available to the public.



The problems introduced with this novell-ncpns-5.3-0.12  version  :

It became clear the provided solution was inconclusive as a number of customers using SMS compliant 3rd parrty backup applications suddenly encountered an enormous drop in backup performance. The drop in performance exhibited itself as the same backup, literally taking up to 4x the amount of time as before the problems were seen.

At the same time, a significant number of customers reported to encounter crashes (segfaults) in different arae's, such as the Novell AFP stack, or in Adminusd (during for example managing user quota's), or run into CIMOM warnings when using iManager to manage user quota's.


Resolution

As per the 19th of June 2012, customers do no longer require to contact Novell Technical Support and request an FTF as Novell has released a solution to the bugs listed below for both OES2 and OES11 to the respective Update catalogs for customers to apply to their environments.

To be precise on the (internal and external) reported bugs below, all had the same root cause, and all of the bugs listed below are resolved with the published solution :
  • Bug 612452 - Restore of NSS metadata fails via xattr when there are no local replica on the server
  • Bug 745847 - AFP core for MapGUIDToID function call
  • Bug 747903 - NSSMU throws error 23388 during volume creation
  • Bug 748802 - viewing user quotas from iManager give CIMOM error after deleting the user having quota
  • Bug 749428 - Performance degradation after OES2 SP2 to SP3 upgrade
  • Bug 755875 - zERR_USER_ABORTED when trying to update eDirectory for pool object
  • Bug 756027 - OES2SP3 - segfault in adminusd

Other than the regular OES patch channels, they have also been made available for separate download on the Novell download site :


Cause

Backup Performance degradation :
The cause for the enormous drop in backup performance is because the way Novell has changed the GUID to DN behavior in the  novell-ncpns-5.3-0.12 release, in terms of how the DN is obtained.  Previously, when a DN was required, a search for the DN was done within the local eDir replica only, and with the release of novell-ncpns-5.3-0.12. we now changed this to perform a tree-wide search to obtain this DN.

Performing a tree-wide search in Enterprise production environments may not be a considered a very good idea, and a potential problem for several reasons. In this particular scenario it was very bad as we now ran in to 'stale' GUID's.
'Stale' GUIDs can for example exist as a result from deleted user objects in eDir, but where the deleted user at the same time remains existing as trustee (or owner, deleter, archiver, etc) on the file system.

Running into stale GUID's caused SMS to perform a tree-wide search, in an attempt to convert every encountered (stale) GUID's to a DN, and this tree-wide search for every 'stale 'GUID' is the  root cause for some reported massive backup performance problems.
(see [1] in the additional information section below for symptoms)

The backup performance problems were not the only issues encountered, and additional symptoms such as segfaults are reported as a result of the Map GUID To DN problems that are caused by a tree wide search for a DN.


Segfaults :
The encountered crashes manifested as outlined below :
  • segfault in AFP
(see [2] in the additional information section below for symptoms)

The reason for these crashes is in the fact that internally AFP also uses the NCP 'MapGUIDToDN' function, which is an NCP function to map a GUID to a DN.

Previously on passing an invalid GUID this used to return an error, currently, after the code change, in the same scenario this was always returning NULL, whereas we should also return an error in case the DN length is zero.

  • segfault in Adminusd when managing User quota's,
(see [3] in the additional information section below for symptoms)

The reason for the crashes in Adminusd when managing user quota's are the same as above. When setting or changing quota's, also a "MapGUIDToDN" is performed.

Previously on passing an invalid GUID this used to return error, currently, after the code change, in the same scenario this was always returning NULL, whereas we should also return an error in case the DN length is zero.

  • segfault in Adminusd when adding a volume from NSSMU, for which the Pool object does not yet exist in eDirectory
(see [4] in the additional information section below)

iManager :
When using iManager to manage any user quota's a warning message may be displayed.
(see [5]  in the additional information section below)

Additional Information

[1]  GUID To DN errors observed in various log files during backup

[! 2012-03-11 22:02:14] MapGUIDToRemoteDN: could not Map GUID To DN remote search failed
[! 2012-03-11 22:02:14] MapGUIDToRemoteDN: could not Map GUID To DN remote search failed
[! 2012-03-11 22:02:14] MapGUIDToRemoteDN: could not Map GUID To DN remote search failed
[! 2012-03-11 22:02:14] MapGUIDToRemoteDN: could not Map GUID To DN remote search failed
[! 2012-03-11 22:02:14] MapGUIDToRemoteDN: could not Map GUID To DN remote search failed

and also :
[! 2012-04-04 07:58:56] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0
[! 2012-04-04 07:58:56] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0
[! 2012-04-04 07:58:56] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0
[! 2012-04-04 07:58:56] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0
[! 2012-04-04 07:58:56] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0
[! 2012-04-04 07:58:57] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0
[! 2012-04-04 07:58:57] MapGUIDToRemoteDN: The GUID = 00000003-0000-0000-0000-000000000000  has not been found in the remote replica rc = 0

On OES11, we have received a report that during a data restore action, for data that is initially backed on a OES2 SP3 using a 3rd party non-SMS compliant backup application (using xattr), resulted in the following message to be flooding the /var/log/messages files:

May  4 00:00:19 lnxsrv01 kernel: [1255876.182934] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.188086] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.189522] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.189529] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.189568] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.191780] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.192477] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir
May  4 00:00:19 lnxsrv01 kernel: [1255876.196915] OID_AddEntryIfNotThere - Unable to map GUID to DN using NCP eDir


[2] Segfaults in AFP
The crash observed is shown in /var/log/messages as below :

afptcpd[26145]: segfault at 0 ip 000000000042323d sp 00007f84b88c6330 error 4 in afptcpd[400000+61000]


[3] Segfaults in Adminusd
The crash observed is shown in /var/log/messages as below :

adminusd[16160]: segfault at 00007fff51057000 rip 00002b89df37e7b1 rsp 00007fff51053080 error 4


[4] Segfault in Adminusd
The crash observed is shown in /var/log/messages as below :

adminusd[16274]: segfault at 00007fff3a14a000 rip 00002ae16d8e891f rsp 00007fff3a147620 error 6


When having 'nsscon /vfs' switch enabled, the following would be logged as well :
Apr  5 09:54:10 srv kernel: VFS: Write - num bytes=89  offset=0  pid=14299
Apr  5 09:54:10 srv kernel: WRITE DATA (pid=14299)=<nssRequest><pool><getNDSName><poolName>DATA3</poolName></getNDSName></pool></nssRequest>
Apr  5 09:54:10 srv kernel: adminusd[16274]: segfault at 00007fff3a14a000 rip 00002ae16d8e891f rsp 00007fff3a147620 error 6
Apr  5 09:54:10 srv adminus daemon: Adminusd proc: zero length read.
Apr  5 09:54:10 srv adminus daemon: adminusd: Handling function aborted
Apr  5 09:54:10 srv kernel: VFS: Read - num bytes=4096  offset=0  pid=14299
Apr  5 09:54:10 srv kernel: READ DATA (pid=14299)=<writeResult>
Apr  5 09:54:10 srv kernel: <error>zERR_USER_ABORTED(20017)</error>
Apr  5 09:54:10 srv kernel: </writeResult>

Another symptom of adminusd malfunctioning prior a crash is shown as the following errors in the /var/log/messages file :
Apr  4 12:40:33 srv adminus daemon: Adminusd proc: zero length read.
Apr  4 12:40:33 srv adminus daemon: adminusd: Handling function aborted
Apr  4 12:41:43 srv adminus daemon: Adminusd proc: zero length read.
Apr  4 12:41:43 srv adminus daemon: adminusd: Handling function aborted
Apr  4 12:41:46 srv adminus daemon: Adminusd proc: zero length read.
Apr  4 12:41:46 srv adminus daemon: adminusd: Handling function aborted
Apr  4 12:42:24 srv adminus daemon: Adminusd proc: zero length read.
Apr  4 12:42:24 srv adminus daemon: adminusd: Handling function aborted




[5] Warning message found in iManager :
The message as observed is iManager when managing user quota's is shown as below :

Could not get volume user restriction information on VOL1. CIMOM error occurred: cannot write to the given file.


Last but not least, for all the issues described above, after inspection of the code, it has been determined that OES11 is confirmed to be vulnerable to the same problems.

Please see the Resolution section to identify since which patch the solution has been published.