Collector Development Guide

DATA PARSING

So far we've gone through a quick process to bootstrap the development process, and get a quick prototype of our new Collector up and running in a live Sentinel environment. We have written any actual code yet, or modified any of the other resources in the Collector source directory. Let's pause and make sure we put this all in context.

The recommended development process looks like:

Figure out what product you are trying to collect data from. Do some initial research and gather the product name, vendor name, and if possible the product documentation, technical information such as how it generates and stores audit event records, and that sort of thing.
Using the Eclipse interface, create a new Collector plug-in based on the template and name it based on the product you are collecting from (see Getting Started above).
Create an initial build of the unmodified new Collector and import it into ESM.
Configure the event source and the Collector/Collector/Event Source chain to get a set of test records from the source (see Initial Build). Optionally, you can use the Generic Event Collector for this purpose.
Modify the Collector resources to properly parse input records from the source.
Construct an output Event object based on the parsed data (see Event Construction).
As you develop your parsing logic, repeatedly create development builds to test and debug your code (see Build Process)
Edit the Collector meta-data, select/create parameters, finalize documentation, and select and/or create Pack controls (various sections) to enhance the functionality of your Collector.
Perform a final build and release your plug-in

We've already covered the first few steps here, and now we'll delve into the development of parsing logic.

The Scripting Environment

The purpose of the Collector is to parse input records into native Sentinel-domain objects. For this purpose we use the JavaScript language with some custom extensions:

The JavaScript interpreter is based on Rhino.
You can use all of the standard JS objects including E4X.
Many Sentinel-specific objects are defined, like Identities and Accounts, that you can access and use in your code.
- See the Sentinel JavaScript API for details.
Some sample code is provided to, for example, get data from a Connector.
[Advanced] You can call Java methods by importing the appropriate JDK classes.
[Advanced] You can also define your own Java objects and methods in your own JAR files, and include them in the plug-ins.

In other words, you have an extremely powerful and flexible system at your disposal that can do virtually any sort of computing task that you can imagine. On the other hand, you will be best served by limiting what the Collector does to its defined task — parsing data — rather than using it as a general-purpose Sentinel extension mechanism. Since Collectors run in the Sentinel process space, using them to do things other than parsing risks system instability and performance degradation.

The Input Record

If you've been paying attention, you'll recall that the Collector template implements a tight loop that receives a new record from the Connector, parses it, converts it into a Sentinel Event, sends that event, resets, then receives another record and repeats. So what does this input record from the Connector look like?

First, note that the input record is usually passed from the associated Connector to the Collector (we say "usually" because it is possible, although highly unusual, to do this in different ways). In this SDK we represent the input record as the global object, 'rec', which is of class Record; the code simply makes a call like rec = conn.read() and what comes back is an input record.

A Record usually contains the following information:

Data from the event source:
- If the source is line-oriented like a file, syslog, etc, then the raw data will typically be in the rec.s_RXBufferString variable.
- If the source produces structured data, like a database query that returns columns, the raw data will typically be in an object rec.RXMap that has a set of attributes, one for each column or input field.
Metadata about the event source:
- This includes things such as the IP address of the host that sent us the event data (for the Syslog Connector and others), the file name (for the File Connector), and anything else that the Connector can determine about the event source and how the record got to the Connector. This data is often used to supplement the output Event with additional information beyond that supplied solely within the source's event record.
Metadata about the Connector:
- This included the Connector name, what version it is, how it is configured, and so forth. This data is often used to select different parsing paths depending on how the data was gathered and how it was processed by the Connector.

The metadata fields are usually stored as attributes on the rec object with an s_ or i_ prefix, for example s_Version, s_FileName, and so forth. There are a couple major exceptions such as rec.CONNECTION_METHOD. The Connector documentation should fully enumerate the list of fields that it will typically produce for metadata, and that list can be compared to the captured sample data to give you a sense for what data those fields can contain.

Here's some sample input from the Syslog Connector:

{
    "CONNECTION_METHOD":"SYSLOG",
    "CONNECTION_MODE":"map",
    "s_Version":"6r9",
    "s_SyslogRelayIp":"192.168.225.1",
    "s_MessageOriginatorPort":"514",
    "s_RXBufferString":"Oct 14 14:19:06 192.168.225.1 %PIX-2-106012: Deny IP from 10.10.15.103 
            to 24.58.92.101, IP options 0x80487f3e",
    "i_RXBufferLength":"190",
    "i_syslog_priority":"190",
    "i_syslog_facility":"23",
    "i_syslog_severity":"6"
    "s_Date":"Oct 14 14:19:06",
    "i_Month":"9",
    "i_DayOfMonth":"14",
    "i_Hour":"14",
    "i_Minute":"19",
    "i_Second":"6",
    "i_milliseconds":"1287091146000",
    "s_MessageOriginatorHost":"192.168.225.1",
    "s_Body":"%PIX-2-106012: Deny IP from 10.10.15.103 to 24.58.92.101, IP options 0x80487f3e",
    "s_AppId":"PIX",
    "s_Process":null,
    "s_Pid":null,
    "s_RV24":"FD0A53E0-A7F8-102D-B077-001A6450619A",
    "s_RV25":"59748C80-BA06-102D-AE4C-001A6450619A",
    "i_TrustDeviceTime":"",
}

In this case, the first few fields tell us what Connector was in use, what mode it was configured to use (this is set by the Collector), and what version of the Connector it was. Next, we have some information about the host that sent us the data (the relay or Reporter, not necessarily the actual host that generated the event record). Then we have the full original event record, followed by a set of pre-parsed data pulled from the syslog header (the Syslog Connector can do this parsing because the format is fixed). Finally we have the two s_RVNN fields, which give us the UUIDs of the Connector and Event Source ESM nodes, respectively, and then a variable that tells us whether Sentinel is configured to trust the event source's clock.

As you can see, the Connector does quite a bit of work to pre-parse some of the fields out of the input record, especially if the input is known to follow a particular standard. This leaves the Collector developer with the job of parsing the rest of the record — for example, pulling the relevant IP addresses out of the rec.s_Body variable and storing them in SourceIP and TargetIP — and constructing a nice descriptive output Event.

Guidelines for Writing Parsing Logic

There are some key points to remember when you sit down to write parsing logic:

The Collector template handles the control flow of fetching the event record from the Connector, and then calling the parsing methods for that record.
It does this by calling defined member methods of the Record class, like this: rec.parse(e). You will see that the parsing methods you are to write are defined as prototypes on the Record class, which will mean that they are accessible as member methods.
When the template calls a method like rec.parse(e), the script execution drops into the scope of the rec object. This means that the rec object is now referred to as this, and all those attributes you looked at above, like rec.CONNECTION_METHOD, will now be available as this.CONNECTION_METHOD.
Part of the reason it is relatively important to define methods like parse() as prototypes is that in some cases you may wish to call parse() on Record objects that are not the rec global — this is used when a single input record does not correspond to a single output Event, for example.
The template will call a series of methods on the rec object, as follows:
- rec.preParse(): The purpose of this method is to clean up the input record and reformat it if necessary into a standard form. For example, in some cases some unnecessary headers may be removed, or perhaps the Collector can handle input from different Connectors but works best if the format is normalized before parsing.
- rec.parse(): This is the core parsing code, where the input record is chopped up into lots of little pieces, each of which represents a distinct semantic component.
- rec.normalize(): In this method, any of the pieces of the record that don't quite match what Sentinel expects are converted, data is normalized, additional data is mapped in, etc.
- rec.postParse(): This method is rarely used, but can be set up to do things like send acknowledgements back to the source to indicate that parsing was successful.
(the template also calls rec.customPreparse() and rec.customParse(), but those should never be modified by the developer of a Collector — they should be left for very localized implementation overrides).
Although the parsing states are broken up into phases as described above, there's no need to be strict about this — the phases are provided for convenience not to force a particular style. Some developers simply do all their work in parse() and leave it at that.
The template provides some sample code and some handlers for particular types of Connectors in the release.js file, which is what you as the developer will be editing. You can choose to use that code as is, delete it, or edit it to suit your needs.
The purpose of the parsing and normalization is not to construct an output Event — that will happen later. The purpose is to break up the input into distinct bits and pieces, and to normalize those pieces, so that later an Event can be constructed with all the right data. We'll discuss this more in a bit.
Because the parsing methods are called as member methods of the rec object, and local variables defined in your parsing logic will disappear when that particular method exits. As a result, if you actually want to keep your partial parsing results around, you'll need to store them in a global variable where they can be found again. The typical way of doing this, and the method the Collector assumes, is simply to create additional attributes on the rec object and keep adding to that set as you parse out additional fields. Thus the rec object becomes a container that holds the original raw data, the metadata provided by the Connector, and also all your parsed data.

Types of Input Records

In general, input records can be categorized in a couple different ways. Some sources, like databases, produce structured records with a set of defined fields (columns) that will not vary from record to record. Other sources (such as Cisco firewalls) produce unstructured records where any particular input record looks very little like other input records. And of course there are hybrids, like Juniper firewalls, where some records follow a nice clean structured name-value pair form, but other records do not.

The way the template is structured makes it very easy to handle structured records - in fact in many cases you don't have to write any parsing logic at all to handle such records. Note that it doesn't actually matter if the input is pre-processed into a structured object by the Connector (like the Database Connector) or comes through as a long structured string — the preParse() method will typically be used to pre-process structured data into a native JavaScript object.

To explain why this is the case, we need to look at the Rec2Evt.map file. That file defines a mapping between a Record object and an Event object. In essence, it lists each field of the Event, and then describes where in the Record the source data for that field can be found.

Now, for structured input records this is super-convenient: let's say we have a database and our query picks up three columns, IP1, IP2, and USER. These show up in the rec object as rec.RXMap.IP1 and so forth. And let's say we know that IP1 corresponds to what we think of on the Sentinel side as SourceIP, IP2 as TargetIP, and USER and InitiatorUserName. So then all we have to do is set up our Rec2Evt.map as follows:

SourceIP,RXMap.IP1
TargetIP,RXMap.IP2
InitiatorUserName,RXMap.USER

(note that the thing after the column describes the member attribute of the Record from which the data can be pulled. Do not include the name of that object i.e. rec). That's it! With no parsing logic, we've mapped three fields to the output, and they will automatically be set when the Event is sent.

So where does this leave unstructured records, or hybrid systems? Well, the approach you take will depend to some degree on the amount of unstructured-ness. If most records follow a pattern, or there are nice groupings, then you can set up several different Record-to-Event conversion map files and switch between them. In fact, if you had N different input patterns and you wanted to define N different mappings, you could do that. But in practice, what typically happens is that developers end up parsing the raw event record into a pseudo-schema in the rec object, which really means they pick some pre-defined field names and "map" each unique record to that pseudo-schema as they go. One could use the long Sentinel field names, in which case the Rec2Evt.map file would end up with a lot of lines like SourceIP,SourceIP, but generally people use the Sentinel internal tag names so they don't have to type so much, or something close to that.

To give an example, let's say your input record has a string in rec.s_RXBufferString like this Deny IP from 10.10.15.103 to 24.58.92.101. In this case your parsing might look like:

/(.*) from (.*) to (.*)/.exec(this.s_RXBufferString);
this.evt = RegExp.$1;
this.sip = RegExp.$2;
this.dip = RegExp.$3;

Then all you'd have to do is edit Rec2Evt.map as follows:

EventName,evt
SourceIP,sip
TargetIP,dip

Again, the names you give to the rec attributes are not important, as long as you are consistent across all the events that will use that specific Rec2Evt.map.

Parsing and Normalization Tools

Most if not all event sources will send Sentinel strings of one sort or another, even the ones that send structured data. Luckily, JavaScript provides a wide variety of parsing methods that you can use which make the chore of chopping up input records into an much easier job. Reference sites like W3 Schools can help you learn the available methods. Methods like substr(), indexOf(), and the various regex operators are used quite commonly.

In addition to the standard JavaScript String methods, the SDK provides some custom convenience methods for String objects that are designed to assist in parsing typical data structures that we see all the time coming from IT systems. These are documented in our API for the String object, and include custom parsers for name-value pairs, quoted CSV strings, and so forth.

For date handling, we include in the template the date.js library, which provides a wide variety of convenience methods for parsing and comparing dates and times. These methods are often used to construct the Date object used to set the date and time of the output Event.

When you get to normalizing the data in the input record it is quite common to use maps to map classes of input to output data. For this purpose we provide our KeyMap object and associated methods.

Finally, in some cases you might find that a single input record is not enough, that you need to combine data from several records into a single output record. We have provided Sessions to handle this scenario, but as these are a more complex topic they will be covered elsewhere.

Requirements for Methods

Here are some things you really should do in your Collector methods:

rec.preParse():
- Check the input record rec and make sure you actually got real data.
- Do any record cleanup, format normalization, and pre-processing of regular data structures.
- Return true if everything looks good so far, false if not to short-circuit parsing and skip to the next record.
rec.parse():
- Attempt to recognize and parse data values from the event record, and store them back in this.
- If the record is not recognized, optionally call this.sendUnsupported() to send the raw data to Sentinel, and return false.
- Set the flag instance.SEND_EVENT if you determine that the current record should indeed be sent as an event.
- Return true if everything looks good so far, false if not to short-circuit parsing and skip to the next record.
rec.normalize():
- Construct a JavaScript Date object from the event data, and call e.setObserverEventTime() with that date. Optionally do so for BeginTime and EndTime as well.
- Set the taxonomy key for this record — see the taxonomy chapter.
- Return true if everything looks good so far, false if not to short-circuit parsing and skip to the next record.

As mentioned, you have plenty of flexibility as to exactly where you do each of the tasks listed above — in which method you set instance.SEND_EVENT, where you set the taxonomy key — but as long as they happen somewhere, your Collector should function correctly.

Forward to Build Process
Back up to Develop to Sentinel

Collector Development Guide

DATA PARSING

The Scripting Environment

The Input Record

Guidelines for Writing Parsing Logic

Types of Input Records

Parsing and Normalization Tools

Requirements for Methods

Collector Development Guide

Development Topics