Collector Development Guide
DATA PARSING
So far we've gone through a quick process to bootstrap the development process, and get a quick prototype of our new Collector up and running in a live Sentinel environment. We have written any actual code yet, or modified any of the other resources in the Collector source directory. Let's pause and make sure we put this all in context.
The recommended development process looks like:
- Figure out what product you are trying to collect data from. Do some initial research and gather the product name, vendor name, and if possible the product documentation, technical information such as how it generates and stores audit event records, and that sort of thing.
- Using the Eclipse interface, create a new Collector plug-in based on the template and name it based on the product you are collecting from (see Getting Started above).
- Create an initial build of the unmodified new Collector and import it into ESM.
- Configure the event source and the Collector/Collector/Event Source chain to get a set of test records from the source (see Initial Build). Optionally, you can use the Generic Event Collector for this purpose.
- Modify the Collector resources to properly parse input records from the source.
- Construct an output Event object based on the parsed data (see Event Construction).
- As you develop your parsing logic, repeatedly create development builds to test and debug your code (see Build Process)
- Edit the Collector meta-data, select/create parameters, finalize documentation, and select and/or create Pack controls (various sections) to enhance the functionality of your Collector.
- Perform a final build and release your plug-in
We've already covered the first few steps here, and now we'll delve into the development of parsing logic.
The Scripting Environment
The purpose of the Collector is to parse input records into native Sentinel-domain objects. For this purpose we use the JavaScript language with some custom extensions:
- The JavaScript interpreter is based on Rhino.
- You can use all of the standard JS objects including E4X.
- Many Sentinel-specific objects are defined, like Identities and Accounts, that you can access and use in your code.
- See the Sentinel JavaScript API for details.
- Some sample code is provided to, for example, get data from a Connector.
- [Advanced] You can call Java methods by importing the appropriate JDK classes.
- [Advanced] You can also define your own Java objects and methods in your own JAR files, and include them in the plug-ins.
In other words, you have an extremely powerful and flexible system at your disposal that can do virtually any sort of computing task that you can imagine. On the other hand, you will be best served by limiting what the Collector does to its defined task — parsing data — rather than using it as a general-purpose Sentinel extension mechanism. Since Collectors run in the Sentinel process space, using them to do things other than parsing risks system instability and performance degradation.
The Input Record
If you've been paying attention, you'll recall that the Collector template implements a tight loop that receives a new record from the Connector, parses it, converts it into a Sentinel Event, sends that event, resets, then receives another record and repeats. So what does this input record from the Connector look like?
First, note that the input record is usually passed from the associated Connector to the Collector (we say "usually" because it is possible,
although highly unusual, to do this in different ways). In this SDK we represent the input record as the global object,
'rec', which is of class Record; the code simply makes a call like rec = conn.read()
and what comes back is an input record.
A Record usually contains the following information:
- Data from the event source:
- If the source is line-oriented like a file, syslog, etc, then the raw data will typically be in the
rec.s_RXBufferString
variable. - If the source produces structured data, like a database query that returns columns, the raw data will typically be in an object
rec.RXMap
that has a set of attributes, one for each column or input field.
- If the source is line-oriented like a file, syslog, etc, then the raw data will typically be in the
- Metadata about the event source:
- This includes things such as the IP address of the host that sent us the event data (for the Syslog Connector and others), the file name (for the File Connector), and anything else that the Connector can determine about the event source and how the record got to the Connector. This data is often used to supplement the output Event with additional information beyond that supplied solely within the source's event record.
- Metadata about the Connector:
- This included the Connector name, what version it is, how it is configured, and so forth. This data is often used to select different parsing paths depending on how the data was gathered and how it was processed by the Connector.
The metadata fields are usually stored as attributes on the rec object with an s_
or i_
prefix, for example s_Version
,
s_FileName
, and so forth. There are a couple major exceptions such as rec.CONNECTION_METHOD
. The Connector documentation should fully
enumerate the list of fields that it will typically produce for metadata, and that list can be compared to the captured sample data to give you a sense for what
data those fields can contain.
Here's some sample input from the Syslog Connector:
{ "CONNECTION_METHOD":"SYSLOG", "CONNECTION_MODE":"map", "s_Version":"6r9", "s_SyslogRelayIp":"192.168.225.1", "s_MessageOriginatorPort":"514", "s_RXBufferString":"Oct 14 14:19:06 192.168.225.1 %PIX-2-106012: Deny IP from 10.10.15.103 to 24.58.92.101, IP options 0x80487f3e", "i_RXBufferLength":"190", "i_syslog_priority":"190", "i_syslog_facility":"23", "i_syslog_severity":"6" "s_Date":"Oct 14 14:19:06", "i_Month":"9", "i_DayOfMonth":"14", "i_Hour":"14", "i_Minute":"19", "i_Second":"6", "i_milliseconds":"1287091146000", "s_MessageOriginatorHost":"192.168.225.1", "s_Body":"%PIX-2-106012: Deny IP from 10.10.15.103 to 24.58.92.101, IP options 0x80487f3e", "s_AppId":"PIX", "s_Process":null, "s_Pid":null, "s_RV24":"FD0A53E0-A7F8-102D-B077-001A6450619A", "s_RV25":"59748C80-BA06-102D-AE4C-001A6450619A", "i_TrustDeviceTime":"", }
In this case, the first few fields tell us what Connector was in use, what mode it was configured to use (this is set by the Collector), and what version of the Connector it was. Next, we have some information about the host that sent us the data (the relay or Reporter, not necessarily the actual host that generated the event record). Then we have the full original event record, followed by a set of pre-parsed data pulled from the syslog header (the Syslog Connector can do this parsing because the format is fixed). Finally we have the two s_RVNN fields, which give us the UUIDs of the Connector and Event Source ESM nodes, respectively, and then a variable that tells us whether Sentinel is configured to trust the event source's clock.
As you can see, the Connector does quite a bit of work to pre-parse some of the fields out of the input record, especially if the input is known to follow a particular standard.
This leaves the Collector developer with the job of parsing the rest of the record — for example, pulling the relevant IP addresses out of the rec.s_Body
variable and storing
them in SourceIP and TargetIP — and constructing a nice descriptive output Event.
Guidelines for Writing Parsing Logic
There are some key points to remember when you sit down to write parsing logic:
- The Collector template handles the control flow of fetching the event record from the Connector, and then calling the parsing methods for that record.
- It does this by calling defined member methods of the Record class, like this:
rec.parse(e)
. You will see that the parsing methods you are to write are defined as prototypes on the Record class, which will mean that they are accessible as member methods. - When the template calls a method like
rec.parse(e)
, the script execution drops into the scope of therec
object. This means that therec
object is now referred to asthis
, and all those attributes you looked at above, likerec.CONNECTION_METHOD
, will now be available asthis.CONNECTION_METHOD
. - Part of the reason it is relatively important to define methods like
parse()
as prototypes is that in some cases you may wish to callparse()
on Record objects that are not therec
global — this is used when a single input record does not correspond to a single output Event, for example. - The template will call a series of methods on the
rec
object, as follows:rec.preParse()
: The purpose of this method is to clean up the input record and reformat it if necessary into a standard form. For example, in some cases some unnecessary headers may be removed, or perhaps the Collector can handle input from different Connectors but works best if the format is normalized before parsing.rec.parse()
: This is the core parsing code, where the input record is chopped up into lots of little pieces, each of which represents a distinct semantic component.rec.normalize()
: In this method, any of the pieces of the record that don't quite match what Sentinel expects are converted, data is normalized, additional data is mapped in, etc.rec.postParse()
: This method is rarely used, but can be set up to do things like send acknowledgements back to the source to indicate that parsing was successful.
rec.customPreparse()
andrec.customParse()
, but those should never be modified by the developer of a Collector — they should be left for very localized implementation overrides). - Although the parsing states are broken up into phases as described above, there's no need to be strict about this — the phases are provided for convenience not
to force a particular style. Some developers simply do all their work in
parse()
and leave it at that. - The template provides some sample code and some handlers for particular types of Connectors in the release.js file, which is what you as the developer will be editing. You can choose to use that code as is, delete it, or edit it to suit your needs.
- The purpose of the parsing and normalization is not to construct an output Event — that will happen later. The purpose is to break up the input into distinct bits and pieces, and to normalize those pieces, so that later an Event can be constructed with all the right data. We'll discuss this more in a bit.
- Because the parsing methods are called as member methods of the
rec
object, and local variables defined in your parsing logic will disappear when that particular method exits. As a result, if you actually want to keep your partial parsing results around, you'll need to store them in a global variable where they can be found again. The typical way of doing this, and the method the Collector assumes, is simply to create additional attributes on therec
object and keep adding to that set as you parse out additional fields. Thus therec
object becomes a container that holds the original raw data, the metadata provided by the Connector, and also all your parsed data.
Types of Input Records
In general, input records can be categorized in a couple different ways. Some sources, like databases, produce structured records with a set of defined fields (columns) that will not vary from record to record. Other sources (such as Cisco firewalls) produce unstructured records where any particular input record looks very little like other input records. And of course there are hybrids, like Juniper firewalls, where some records follow a nice clean structured name-value pair form, but other records do not.
The way the template is structured makes it very easy to handle structured records - in fact in many cases you don't have to write any parsing logic at all to handle such records.
Note that it doesn't actually matter if the input is pre-processed into a structured object by the Connector (like the Database Connector) or comes through as a long structured
string — the preParse()
method will typically be used to pre-process structured data into a native JavaScript object.
To explain why this is the case, we need to look at the Rec2Evt.map file. That file defines a mapping between a Record object and an Event object. In essence, it lists each field of the Event, and then describes where in the Record the source data for that field can be found.
Now, for structured input records this is super-convenient: let's say we have a database and our query picks up three columns, IP1, IP2, and USER. These show up in the
rec
object as rec.RXMap.IP1
and so forth. And let's say we know that IP1 corresponds to what we think of on the Sentinel side as
SourceIP, IP2 as TargetIP, and USER and InitiatorUserName. So then all we have to do is set up our Rec2Evt.map as follows:
SourceIP,RXMap.IP1 TargetIP,RXMap.IP2 InitiatorUserName,RXMap.USER
(note that the thing after the column describes the member attribute of the Record from which the data can be pulled. Do not include the name of that object
i.e. rec
). That's it! With no parsing logic, we've mapped three fields to the output, and they will automatically be set when the Event is sent.
So where does this leave unstructured records, or hybrid systems? Well, the approach you take will depend to some degree on the amount of unstructured-ness. If
most records follow a pattern, or there are nice groupings, then you can set up several different Record-to-Event conversion map files and switch between them. In fact, if you
had N different input patterns and you wanted to define N different mappings, you could do that. But in practice, what typically happens is that developers end up parsing the
raw event record into a pseudo-schema in the rec
object, which really means they pick some pre-defined field names and "map" each unique record
to that pseudo-schema as they go. One could use the long Sentinel field names, in which case the Rec2Evt.map file would end up with a lot of lines like
SourceIP,SourceIP
, but generally people use the Sentinel internal tag names so they don't have to type so much, or something close to that.
To give an example, let's say your input record has a string in rec.s_RXBufferString
like this Deny IP from 10.10.15.103 to 24.58.92.101
. In this case
your parsing might look like:
/(.*) from (.*) to (.*)/.exec(this.s_RXBufferString); this.evt = RegExp.$1; this.sip = RegExp.$2; this.dip = RegExp.$3;
Then all you'd have to do is edit Rec2Evt.map as follows:
EventName,evt SourceIP,sip TargetIP,dip
Again, the names you give to the rec
attributes are not important, as long as you are consistent across all the events that will use that specific Rec2Evt.map.
Parsing and Normalization Tools
Most if not all event sources will send Sentinel strings of one sort or another, even the ones that send structured data. Luckily, JavaScript provides a wide variety of parsing
methods that you can use which make the chore of chopping up input records into an much easier job. Reference sites like
W3 Schools can help you learn the available methods.
Methods like substr()
, indexOf()
, and the various regex operators are used quite commonly.
In addition to the standard JavaScript String methods, the SDK provides some custom convenience methods for String objects that are designed to assist in parsing typical data structures that we see all the time coming from IT systems. These are documented in our API for the String object, and include custom parsers for name-value pairs, quoted CSV strings, and so forth.
For date handling, we include in the template the date.js library, which provides a wide variety of convenience methods for parsing and comparing dates and times. These methods are often used to construct the Date object used to set the date and time of the output Event.
When you get to normalizing the data in the input record it is quite common to use maps to map classes of input to output data. For this purpose we provide our KeyMap object and associated methods.
Finally, in some cases you might find that a single input record is not enough, that you need to combine data from several records into a single output record. We have provided Sessions to handle this scenario, but as these are a more complex topic they will be covered elsewhere.
Requirements for Methods
Here are some things you really should do in your Collector methods:
rec.preParse()
:- Check the input record
rec
and make sure you actually got real data. - Do any record cleanup, format normalization, and pre-processing of regular data structures.
- Return
true
if everything looks good so far,false
if not to short-circuit parsing and skip to the next record.
- Check the input record
rec.parse()
:- Attempt to recognize and parse data values from the event record, and store them back in
this
. - If the record is not recognized, optionally call
this.sendUnsupported()
to send the raw data to Sentinel, and returnfalse
. - Set the flag
instance.SEND_EVENT
if you determine that the current record should indeed be sent as an event. - Return
true
if everything looks good so far,false
if not to short-circuit parsing and skip to the next record.
- Attempt to recognize and parse data values from the event record, and store them back in
rec.normalize()
:- Construct a JavaScript Date object from the event data, and call
e.setObserverEventTime()
with that date. Optionally do so for BeginTime and EndTime as well. - Set the taxonomy key for this record — see the taxonomy chapter.
- Return
true
if everything looks good so far,false
if not to short-circuit parsing and skip to the next record.
- Construct a JavaScript Date object from the event data, and call
As mentioned, you have plenty of flexibility as to exactly where you do each of the tasks listed above — in which method you set instance.SEND_EVENT
, where you set
the taxonomy key — but as long as they happen somewhere, your Collector should function correctly.
- Forward to Build Process
- Back up to Develop to Sentinel
Collector Development Guide
- Overview
- Getting Started
- Initial Build
- Plug-in Contents
- Data Parsing
- Build Process
- Event Construction
- Taxonomy
- Connector Interaction
- Common Code
- Parameters
- Additional Information
Development Topics
- Troubleshooting and Debugging
- Extended String Methods
- KeyMap and DataMap
- Identity and Account
- Asset and Vulnerability
- Working With Dates
- Sessions
- On-site Customization