Potential secondary storage data loss with Sentinel 7.1

  • 7014515
  • 06-Feb-2014
  • 17-Sep-2014

Environment

Sentinel 7.1.x
Sentinel 7.x

Situation

Engineering has determined some versions of Sentinel are susceptible to a potential issue present in recent versions of Oracle Java which could cause significant data loss.
Not all versions or implementations of Sentinel are affected.

If you:
    Are running one of these versions of Sentinel in production: 7.1.0.0, 7.1.0.1, 7.1.1.0, or 7.1.1.1
    Have secondary (formerly "network") storage configured
    Are processing close to 500 EPS in any single parsed event partition
    Are seeing #PLACEHOLDER# messages in CSV exports or in server0.0.log
then read on to determine how to evaluate and fix your system.

Resolution

The following installer versions contain the fix:

Clean
-  7.1.1.2
Upgrade
-  7.1.1.2 (Use this for standard installations and all appliances.)
(Upgrade for standard installations available on https://download.novell.com/patch/finder/  Select Sentinel product to search for the patch.)
(Upgrade for appliance installations available in the appliance upgrade channel)
-  7.1.0.2 (Use this for standard installations only if your company's change window does not permit upgrading to 7.1.1.2.
(Upgrade for standard installations available on https://download.novell.com/patch/finder/  Select Sentinel product to search for the patch.)

In these patched versions, Sentinel uses an alternate approach to copying parsed event data from primary storage to secondary storage.  Testing has verified that this approach works as expected.
To further ensure no data is lost, additional checks have been put in place that are performed after the copy task is complete.  These checks occur for both parsed event partitions and raw data and are run automatically after every partition is copied to secondary storage.  If a check fails, the system will automatically retry the copy operation a few minutes later to ensure that the partition is completely copied.

If you believe that your system is at risk of being affected by this issue, download and install a patched version of Sentinel as soon as possible. The patch will prevent the issue from occurring for any new event partitions that are copied to secondary storage.
Once the patch is installed, run the Index Log Check Tool (see below) to scan your system for any partitions that might have been truncated.  If you find any such partitions, contact Support for information on how to recover the secondary event partitions.

Cause

Sentinel 7.1 introduced the use of a new Java API method for copying parsed event data from primary storage to secondary storage that was intended to provide better performance. Unfortunately, this new method exhibits some undocumented behavior that can cause event partitions of size greater than 2GB (compressed) to be truncated with no reported errors. The truncation affects only the secondary storage event partitions, which are copies of primary event partitions.

This issue is likely to occur on Sentinel systems where more than about 500 events per second are stored in an individual event partition, and that partition is copied to secondary storage. Data retention rules that separate the event data into smaller partitions will help keep individual partitions below the 2GB limit. Raw data storage is unaffected by this defect, as are primary storage and all real-time functions.

If you answer yes to all of the following questions, you are potentially affected by this defect:
    Are you using one of these versions of Sentinel: 7.1.0.0, 7.1.0.1, 7.1.1.0, 7.1.1.1?
        In the Sentinel Web UI, visit About in the upper-right tab set.
    Is secondary storage enabled on your system?
        In the Sentinel Web UI, visit Storage > Configuration and read the section under Data Storage Location
        In earlier versions of Sentinel, secondary storage was named "network" storage.
    Is the amount of parsed event data that your system stored in any single partition greater than approximately 2 gigabytes?
        Under Storage > Configuration > Data Retention, calculate the largest average per-day size by dividing the Size column by the At most column, and then dividing by 10 (for compression)
        If the result is above 1GB then you should use the Index Log Check Tool on your system - this accounts for day-to-day variations in volume by setting a low threshold.

The Index Log Check Tool will check your event partitions and inform you which, if any, partitions have experienced data truncation by comparing the event partition index (which is not affected by this issue) with the stored event data. The tool is built into Sentinel as of the fixed versions 7.1.0.2 and 7.1.1.2. See below for instructions on using the tool.

Status

Top Issue

Additional Information

Instructions for Using the Index Log Check Tool
1.  Install the 7.1.x.2 hotfix or later to the Sentinel server by following the instructions in the release notes.
2.  Connect to a console on the Sentinel server and log in as the 'novell' user (or connect as 'root', and then 'su - novell' to become the novell user).
3.  Run the following commands as the 'novell' user (adjust the paths to the Sentinel binaries and secondary storage as needed):

# cd /opt/novell/sentinel/lib
# /opt/novell/sentinel/jre/bin/java -cp ccsbase.jar esecurity.ccs.comp.event.indexedlog.IndexedLogCheck /path/to/secondary/storage/<UUID>/eventdata_archive/* 2> /tmp/checkresults.log

NOTE: for Sentinel 7.2 or later use ccsapp*.jar instead of ccsbase.jar.  For Sentinel 7.2 this would be "ccsapp-7.2.0.0-RELEASE.jar" 
EXAMPLE: /opt/novell/sentinel/jre/bin/java -cp ccsapp-7.2.0.0-RELEASE.jar esecurity.ccs.comp.event.indexedlog.IndexedLogCheck /path/to/secondary/storage/<UUID>/eventdata_archive/* 2> /tmp/checkresults.log

4.  If a partition experienced data truncation, you will see SEVERE messages like the following in the /tmp/checkresults.log file (grep or search for lines marked SEVERE):

Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Checking /tmp/tmparchive/88C9AAD0-69BF-1031-B536-000C2908FFD7/eventdata_archive/20140203_6E1CCA35-4BD4-102D-91CD-000C2907C76D
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Mounting SquashFS index directory
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Opening index
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Opening compressed logrep or
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Index contains 3,485 documents
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Index contains 0 deleted documents
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Compressed log contains 524,288 bytes
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Compression ratio is 8.925:1
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Checking all documents in the index...
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
SEVERE: Document 1,060 points to out-of-range offset 524,769
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
SEVERE: Document 1,061 points to out-of-range offset 525,806
Feb 04, 2014 3:58:45 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
SEVERE: Document 1,062 points to out-of-range offset 526,330
Feb 04, 2014 3:58:46 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Checked 3,485 documents
Feb 04, 2014 3:58:46 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Closing index:
Feb 04, 2014 3:58:46 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: Closing compressed log:
Feb 04, 2014 3:58:46 PM esecurity.ccs.comp.event.indexedlog.IndexedLogCheck check
INFO: /tmp/tmparchive/88C9AAD0-69BF-1031-B536-000C2908FFD7/eventdata_archive/20140203_6E1CCA35-4BD4-102D-91CD-000C2907C76D
Corruptions: 3
Warnings: 0
Check Errors: 0