Data Cleaning

Analyzer now has the capability to not only test for bad data but to clean it as well. Data Cleaning is based on scripting capability of Script Metrics.

The same script is used for either testing or cleaning data. A script used for testing returns the boolean variable "aimReturnStatus" as either True or False depending on whether the data passed the test. The same script returns a string variable "aimReturnValue" containing the cleaned data. When applied as a testing script, Analyzer looks at "aimReturnStatus". When applied as a cleaning script, Analyzer looks at "aimReturnValue" and, if different, modifies the data value. "aimReturnValue" is returned with the original value ("aimValue") if no data modification is to take place. The modified value appears in the Data Browser highlighted in yellow with a pencil icon just as if it had been modified individually.

Data Cleaning Variables

The input variables are the same as for Script Metrics, with the addition of "aimReturnValue".
   aimValue             - the current value being tested
   aimArrayValue        - array of all values in the current cell
   aim_<attrName>       - the value of each attribute in the dataset
   aimArray_<attrName>  - array of values for each attribute

   aimReturnStatus      - boolean variable.  Set to true if value passed test, false if failed.
   aimReturnValue       - the new data value. Return with original value to leave unmodified.

Data Cleaning Script Examples

The following script is both a testing and cleaning script to test for all lowercase letters, and to lowercase them if not.
   aimReturnStatus = true;
   aimReturnValue = aimValue;

   if (aimValue != null)
   {
      lower = aimValue.toLowerCase();
      if ( !aimValue.equals(lower) )
      {
         aimReturnStatus = false;
         aimReturnValue = lower;
      }
   }
When applied as a test script, the following script flags strings longer than 16 characters as failed. When applied as a cleaning script, it truncates the string and replaces the last 3 characters with "...".
   limit = 16;     

   aimReturnStatus = true;
   aimReturnValue = aimValue;
    
   if (aimValue != null) 
   {
      len = aimValue.length;
      if ( len > limit)
      {
         aimReturnStatus = false;
         aimReturnValue = aimValue.substring(0,limit-3) + "...";
      }
   }

Managing Data Cleaning Scripts

A data cleaning script is created, edited, copied, or deleted in the same manner as a testing script. The only difference that turns a testing script into a cleaning script is setting the "aimReturnValue" variable. For example, this script sets a default value for any empty cells it encounters:

To see what values the cleaning script modifies, you can simply apply it with the "Apply Metric to column" option in the right-click menu. To clean the data, you can apply it as a cleaning script with the "Apply Cleaner Script to column" option.

Applying a Cleaning Script


Populating or Deleting values

For an empty cell, the cleaning script is called with "aimValue" = null. You can set a default value or construct one from other attributes set it in "aimReturnValue".
   aimReturnStatus = true;
   aimReturnValue = aimValue;
    
   if (aimValue == null) 
   {
      aimReturnValue = "Some Default Value";
   }
   else
   {  //...
   }

Similarly, a script can delete a value by setting "aimReturnValue" to null. Scripts should always initialize "aimReturnValue" to "aimValue" so it does not appear to Analyzer as a null.