3.9 Cleaning Data

Analyzer can clean bad data as well as test for the bad data. Data cleaning is based on the scripting capabilities of Script Metrics.

The same script is used for either testing or cleaning data. A script used for testing returns the Boolean variable aimReturnStatus as either True or False depending on whether the data passed the test. The same script returns a string variable aimReturnValue containing the cleaned data. When applied as a testing script, Analyzer looks at aimReturnStatus. When applied as a cleaning script, Analyzer looks at aimReturnStatus and, if different, modifies the data value. aimReturnValue is returned with the original value (aimValue) if no data modification is to take place. The modified value appears in the Data Browser highlighted in yellow with a pencil icon, just as if it had been modified individually.

3.9.1 Data Cleaning Variables

The input variables are the same as for Script Metrics, with the addition of aimReturnValue.

   aimValue             - the current value being tested
   aimArrayValue        - array of all values in the current cell
   aim_<attrName>       - the value of each attribute in the dataset
   aimArray_<attrName>  - array of values for each attribute

   aimReturnStatus      - boolean variable.  Set to true if value passed test, false if failed.
   aimReturnValue       - the new data value. Return with original value to leave unmodified.

3.9.2 Data Cleaning Script Examples

The following script is both a testing and cleaning script to test for all lowercase letters, and to lowercase them if they are not already lowercase.

   aimReturnStatus = true;
   aimReturnValue = aimValue;

   if (aimValue != null)
   {
      lower = aimValue.toLowerCase();
      if ( !aimValue.equals(lower) )
      {
         aimReturnStatus = false;
         aimReturnValue = lower;
      }
   }

When applied as a test script, the following script flags strings longer than 16 characters as failed. When applied as a cleaning script, it truncates the string and replaces the last 3 characters with “...”.

   limit = 16;     

   aimReturnStatus = true;
   aimReturnValue = aimValue;
    
   if (aimValue != null) 
   {
      len = aimValue.length;
      if ( len > limit)
      {
         aimReturnStatus = false;
         aimReturnValue = aimValue.substring(0,limit-3) + "...";
      }
   }

3.9.3 Managing Data Cleaning Scripts

A data cleaning script is created, edited, copied, or deleted in the same manner as a testing script. The only difference that turns a testing script into a cleaning script is setting the aimReturnValue variable.

To see what values the cleaning script modifies, you can simply apply it with the Apply Metric to column option in the right-click menu. To clean the data, you can apply it as a cleaning script with the Apply Cleaner Script to column option.

3.9.4 Populating or Deleting Values

For an empty cell, the cleaning script is called with aimValue = null. You can set a default value or construct one from other attributes set it in aimReturnValue.

   aimReturnStatus = true;
   aimReturnValue = aimValue;
    
   if (aimValue == null) 
   {
      aimReturnValue = "Some Default Value";
   }
   else
   {  //...
   }

Similarly, a script can delete a value by setting aimReturnValue to null. Scripts should always initialize aimReturnValue to aimValue so it does not appear to Analyzer as a null.