# Article

article

Score:

## Matching Users Between Systems

### Author Info

13 November 2008 - 7:27am
Submitted by: rkalfane

(View Disclaimer)

## When you can't rely on common unique IDs...

### Introduction

It happens quite often during projects that you need to match users from one system (for instance the Meta-Directory) to another system (for instance a HR System), but you can't use any common unique identifiers between the two systems! Maybe in the HR System, you will only find first name and last name, and no login names...

This little tip may help you find matching users based on their first name and last name. The tool is calculating "distances" between first names and last names in the different systems. Based on the score, you will find the exact matches (same first name, same last name, distance = 0) and very close matches (maybe one letter or two inverted characters difference).

### Method used

The method is using the Levenshtein algorithm to calculate the distance between two strings. This distance is equal to the minimum characters to add, insert or replace to go from one string to the other one.

For instance, the distance between two identical stings is 0 (obvious).

The distance between "Henry" and "Henri" is 1 as you need to replace "y" by "i".

The distance between "Marianne" and "Marian" is 2 as you need to delete two letters.

Here is the algorithm in pseudo-code:

```int LevenshteinDistance(char s[1..m], char t[1..n])
// d is a table with m+1 rows and n+1 columns
declare int d[0..m, 0..n]

for i from 0 to m
d[i, 0] := i
for j from 0 to n
d[0, j] := j

for i from 1 to m
for j from 1 to n
{
if s[i] = t[j] then cost := 0
else cost := 1
d[i, j] := minimum(
d[i-1, j] + 1,     // deletion
d[i, j-1] + 1,     // insertion
d[i-1, j-1] + cost   // substitution
)
}

return d[m, n]

```

### Java code

Based on the pseudo code, here is the Java implementation of the distance calculation:

```    private static int minimum(int a, int b, int c){
if (a<=b && a<=c)
return a;
if (b<=a && b<=c)
return b;
return c;
}

public static int computeLevenshteinDistance(String aStr1, String aStr2) {

char[] str1 = aStr1.toCharArray();
char[] str2 = aStr2.toCharArray();

int[][] distance = new int[str1.length+1][];

for(int i=0; i<=str1.length; i++){
distance[i] = new int[str2.length+1];
distance[i][0] = i;
}
for(int j=0; j<str2.length+1; j++)
distance[0][j]=j;

for(int i=1; i<=str1.length; i++)
for(int j=1;j<=str2.length; j++)
distance[i][j]= minimum(distance[i-1][j]+1, distance[i][j-1]+1,
distance[i-1][j-1]+((str1[i-1]==str2[j-1])?0:1));

return distance[str1.length][str2.length];
}

```

The tool that has been built, based on this calculation, takes 3 arguments in the command-line:

• first name, last name, unique ID export from the first system
• first name, last name, unique ID export from the second system
• a CSV result file to store the matches along with scores calculated

Here is the full source code of the tool explained (not that big):

First, the different imports to be able to read files, write files, manipulate the data:

```import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.util.Iterator;
import java.util.TreeSet;

public class Distance
{

```

As we have seen, here are the Levenshtein distance calculation methods:

```    private static int minimum(int a, int b, int c){
if (a<=b && a<=c)
return a;
if (b<=a && b<=c)
return b;
return c;
}

public static int computeLevenshteinDistance(String aStr1, String aStr2) {

char[] str1 = aStr1.toCharArray();
char[] str2 = aStr2.toCharArray();

int[][] distance = new int[str1.length+1][];

for(int i=0; i<=str1.length; i++){
distance[i] = new int[str2.length+1];
distance[i][0] = i;
}
for(int j=0; j<str2.length+1; j++)
distance[0][j]=j;

for(int i=1; i<=str1.length; i++)
for(int j=1;j<=str2.length; j++)
distance[i][j]= minimum(distance[i-1][j]+1, distance[i][j-1]+1,
distance[i-1][j-1]+((str1[i-1]==str2[j-1])?0:1));

return distance[str1.length][str2.length];
}

```

In the main class, we load the two CSV files ("," is the separator used here) and load them in TreeSet objects, so they are ordered by last name:

```	public static void main( String[] args ) throws Exception
{
PrintWriter oOutResult = new PrintWriter(new BufferedWriter(new FileWriter(args[2])));

String oLine = "";

TreeSet oSys1Set = new TreeSet();
while( ( oLine = oInSys1.readLine() ) != null )
{
String[] oTokens = oLine.split( "," );
String oSn = oTokens[0].toUpperCase();
String oGivenName = oTokens[1].toUpperCase();
String oSys1ID = oTokens[2];
String oSys1Data = oSn + "~" + oGivenName + "~" + oSys1ID;
}

TreeSet oSys2Set = new TreeSet();
while( ( oLine = oInSys2.readLine() ) != null )
{
String[] oTokens = oLine.split(",");
String oSn = oTokens[0].toUpperCase();
String oGivenName = oTokens[1].toUpperCase();
String oUniqueID = oTokens[2];
String oSys2Data = oSn + "~" + oGivenName + "~" + oUniqueID;
}

```

Now we can cycle through all entries in the second system and for each entry, cycle in all the entries of the second system to calculate distances and find the best match:

```
// Cycle through Sys2 data and find closest Sys1 string
Iterator oSys2Iterator = oSys2Set.iterator();
while( oSys2Iterator.hasNext() )
{
String oSys2Data = (String) oSys2Iterator.next();
String[] oSys2Tokens = oSys2Data.split("~");
String oSys2Sn = oSys2Tokens[0];
String oSys2GivenName = oSys2Tokens[1];
String oSys2UniqueID = oSys2Tokens[2];

String oBestMatch = "";
String oBestSn = "";
String oBestGivenName = "";
double oBestScore = 999999;

// Find closest Sys1 string
Iterator oSys1Iterator = oSys1Set.iterator();
while( oSys1Iterator.hasNext() )
{
String oSys1Data = (String) oSys1Iterator.next();
String[] oSys1Tokens = oSys1Data.split("~");
String oSys1Sn = oSys1Tokens[0];
String oSys1GivenName = oSys1Tokens[1];
String oSys1ID = oSys1Tokens[2];

```

The calculation is using the distance between last names and between first names, giving more importance to the last name. You can experiment by changing a bit the way the final score is calculated...

```				 double oScore = java.lang.Math.sqrt(
java.lang.Math.pow( Distance.computeLevenshteinDistance( oSys2Sn , oSys1Sn ) * 2, 2 )
+ java.lang.Math.pow( Distance.computeLevenshteinDistance( oSys2GivenName , oSys1GivenName ), 2 ) );

if ( oScore < oBestScore )
{
oBestScore = oScore;
oBestMatch = oSys1ID;
oBestSn = oSys1Sn;
oBestGivenName = oSys1GivenName;
}

```

If an exact match is found, we remove the entry from the first set of data:

```
if ( oScore == 0 )
{
oSys1Set.remove( oSys1Data );
break;
}
}

```

Finally, we save the best match to the result file:

```
String oResultData = oSys2Sn + "," + oSys2GivenName + "," + oSys2UniqueID + "," + oBestSn + "," + oBestGivenName + "," + oBestMatch + "," + oBestScore;
oOutResult.write( oResultData + "\n" );
}

oOutResult.close();
oInSys2.close();
oInSys1.close();
}
}

```

We can compile the source code of the tool using the following command:

`javac Distance.java `

The result is a Distance.class file we can package in a jar file using the following command:

`jar cvf distance.jar Distance.class `

### Testing the tool

Let's test the tool by processing two CSV files with exact matches and similar matches.

The first CSV file sys1.csv corresponding to the first system is the following:

```William,Abdo,wabdo
Eddie,Austin,eaustin
Jerry,Dunn,jdunn
Deborah,Fair,dfair
Marianne,Griffith,mgriffith
Jeff,Healey,jhealey
Clarence,Holloway,chooloway
Clifford,Hough,chough
Randy,Krumins,rkrumins
Larry,Larsen,llarsen
Sean,Murphy,smurphy
Marie,Saenz,msaenz

```

The second CSV file sys2.csv corresponding to the second system is the following:

```Willy,Abdo,10001
Eddy,Austin,10002
Jeremy,Dunn,10003
Deborah,Fair,10004
Marian,Griffith,10005
Jeff,Healy,10006
Clarence,Halloway,10007
Clifford,Hough,10008
Randy,Krumins,10009
Larry,Larsen,10010
Sean,Murphy,10011
Mary,Saenz,10012

```

We can run the tool by executing the following command:

`java -classpath distance.jar Distance sys1.csv sys2.csv result.csv `

Here is the resulting CSV file:

```CLARENCE,HALLOWAY,10007,CLARENCE,HOLLOWAY,chooloway,1.0
CLIFFORD,HOUGH,10008,CLIFFORD,HOUGH,chough,0.0
DEBORAH,FAIR,10004,DEBORAH,FAIR,dfair,0.0
EDDY,AUSTIN,10002,EDDIE,AUSTIN,eaustin,4.0
JEFF,HEALY,10006,JEFF,HEALEY,jhealey,1.0
JEREMY,DUNN,10003,JERRY,DUNN,jdunn,4.0
LARRY,LARSEN,10010,LARRY,LARSEN,llarsen,0.0
MARIAN,GRIFFITH,10005,MARIANNE,GRIFFITH,mgriffith,4.0
MARY,SAENZ,10012,MARIE,SAENZ,msaenz,4.0
RANDY,KRUMINS,10009,RANDY,KRUMINS,rkrumins,0.0
SEAN,MURPHY,10011,SEAN,MURPHY,smurphy,0.0
WILLY,ABDO,10001,WILLIAM,ABDO,wabdo,6.0

```

We can see entries from the second systems with possible matches in the first system and the score calculated based on the distance between names. If you order by scores, you will first see the exact matches and then names that are very similar (scores 1 or 2).

Using this technique on a few thousands of entries can really help in the data cleaning, matching process, before putting in production a new driver.

BilagaStorlek
distance.zip8.08 kB

Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).

It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.

## Well done!

Submitted by dvandermaas on 23 November 2008 - 2:50pm.

As always, a great tool which will be very useful.
It might be an idea to implement this in the Analyzer product, for it is not in there yet, to my knowledge.
Grtz
David

## Matching Users between systems

Submitted by Shivaji on 24 November 2008 - 8:21am.

Thanks a lot !!!

This is a problem I have to deal with regularly when we need to take lists from different systems and match, say IDs on one system to logins on a third system. Just spent an hour on doing this manually from data dumps from our eProcurement system and our ERP system, and I was getting ready to work on a more permanent solution. You just saved me a few hours of programming,

Thanks again,

Shivaji

## Thanks!

Submitted by rkalfane on 25 November 2008 - 10:03am.

Hello,

Thanks for the feedback, I always appreciate! Of course this can be improved, but could be a good base for special matching algorithm. I remember that I also tested another way where all vowels are removed ("Reza Kalfane" becomes "Rz Klfn"), and depending on the language, this can also help a bit in the matching process...

Cheers,

Réza