Article
Extracting LinkedIn Connections Example
Table of Contents
Automatic HTML form submission
Introduction
Content of the attached archive
Using the script
Technical details
Conclusion
Introduction
This article will show you how to automate HTTP actions such as login to a website, retrieve content on different pages. We will connect as an example to the LinkedIn website, login using your credentials (if you have an account there) and automatically retrieve your connections or someone else's connections.

The goal is to show you how to script automatic form submission, retrieve HTML content. This can be very useful if you want for instance to automatically register a user to a website (provisioning) when there is no other API available or do automated tests on web interface.

I took LinkedIn as an example but it seems that later this year, there will be a LinkedIn API available to developers to connect to the site, do searches, retrieve profiles, connections, etc. This could be a pretty useful and dangerous tool...
Content of the attached archive
Here is the content of the file LinkedIn.zip:
./LinkedIn
|__ get_linkedin_connections
\__ docs
|__ images_linkedin
| \__ *.png
|__ linkedin.txt
\__ linkedin.html
Details:
get_linkedin_connections: Main script retrieving your connection or someone else's connection, if a key is provided. You can display or save the result using TXT, CSV or XML format.docs/linkedin.txt: Wiki source of this articledocs/linkedin.html: Result of the conversion from Wiki to HTML (see Wiki to CoolSolutions Converter)docs/images_linkedin/*: All the pictures used in this article
Using the script
You can call the script by specifying credentials and user to check from the command-line. You can get the list of options anytime using the -h or --help option:
/LinkedIn> ./get_linkedin_connections -h
usage: get_linkedin_connections [options] [output.ext]
retrieve LinkedIn connections and export in different formats
possible formats: TXT, CSV, XML
-h or --help for help
example: get_linkedin_connections -D me@domain.com -w mypass
get_linkedin_connections -D me@domain.com -W
get_linkedin_connections -D me@domain.com -w mypass -k 1234567 -o csv
get_linkedin_connections -D me@domain.com -w mypass -o xml output.xml
options:
--version show program's version number and exit
-h, --help show this help message and exit
-c, --changelog display changelog
-D USER user name (email)
-w PASSWD password
-W prompt for password
-k KEY key of the user to check (logged user by default)
-o OUTTYPE output type: txt, csv or xml [default: txt]
To retrieve your own connections you can use the following commands:
get_linkedin_connections -D me@domain.com -w mypass
or
get_linkedin_connections -D me@domain.com -W
The result will have the following format:
MyFirstName MyLastName's Connections (key=0123456) MyLongTitle My Connection1 (key=1234567) Title1 My Connection2 (key=1234568) Title2 My Connection3 (key=1234569) Title3 My Connection4 (key=1234570) Title4 My Connection5 (key=1234571) Title5 ... ...
To retrieve someone else's connections, you need to specify the key corresponding to that user (when listing your connections with the above command, you will see the keys corresponding to each user):
get_linkedin_connections -D me@domain.com -w mypass -k 1234567
The result will have the following format:
My Connection1's Connections (key=1234567) MyFirstName MyLastName (key=0123456) MyLongTitle My Connection4 (key=1234570) Title4 My Connection5 (key=1234571) Title5 My Connection6 (key=1234572) Title6 My Connection7 (key=1234573) Title7 ... ...
To export the result to a different format, you can use the following commands. To export as CSV:
get_linkedin_connections -D me@domain.com -w mypass -o csv
The result will look like the following:
# MyFirstName MyLastName's Connections (key=0123456) # MyLongTitle # 20 connections "key";"name";"title" "1234567";"My Connection1";"Title1" "1234568";"My Connection2";"Title2" "1234569";"My Connection3";"Title3" "1234570";"My Connection4";"Title4" "1234571";"My Connection5";"Title5" ... ...
To export as XML:
get_linkedin_connections -D me@domain.com -w mypass -o xml
In that case, the result will look like the following:
<?xml version="1.0" encoding="utf8"?>
<profile id="0123456">
<name>MyFirstName MyLastName</name>
<title>MyLongTitle</title>
<connections count="20">
<profile id="1234567">
<name>My Connection1</name>
<title>Title1</title>
</profile>
<profile id="1234568">
<name>My Connection2</name>
<title>Title2</title>
</profile>
<profile id="1234569">
<name>My Connection3</name>
<title>Title3</title>
</profile>
<profile id="1234570">
<name>My Connection4</name>
<title>Title4</title>
</profile>
<profile id="1234571">
<name>My Connection5</name>
<title>Title5</title>
</profile>
...
...
</connections>
</profile>
To save the result to an output file, you can use the following commands:
get_linkedin_connections -D me@domain.com -w mypass -o txt output.txt
or
get_linkedin_connections -D me@domain.com -w mypass -o csv output.csv
or
get_linkedin_connections -D me@domain.com -w mypass -o xml output.xml
Technical details
This section explains the different parts of the script. The global behavior is to log into LinkedIn, go to the Connections page, get the user's information and all connections on multiple pages, if applicable. All connections are stored in a dictionary before being processed to generate the output.
1. The first part specifies the modules to use in the script. The httplib and urllib modules are used to build HTTP URLs, connect to web pages, submit a form, and retrieve the HTML result. The codecs module is only used to write UTF-8 files, as LinkedIn uses Unicode characters in names and titles.
#!/usr/bin/python import getpass, httplib, urllib, codecs, sys, re from htmlentitydefs import name2codepoint as n2cp from optparse import OptionParser
2. The second part handles command-line arguments and options using the OptionParser class. To use the script, you just need the LinkedIn credentials of the user, an optional key if you want to check someone else's connections, and an optional output format if you want to save the result in a file (TXT, CSV or XML formats):
changelog = [ "02/03/2008 - v0.1 - retrieve LinkedIn connections" ]
usage = """%prog [options] [output.ext]
retrieve LinkedIn connections and export in different formats
possible formats: TXT, CSV, XML
-h or --help for help
example: %prog -D me@domain.com -w mypass
%prog -D me@domain.com -W
%prog -D me@domain.com -w mypass -k 1234567 -o csv
%prog -D me@domain.com -w mypass -o xml output.xml"""
# Handle command-line options and arguments
parser = OptionParser(usage=usage, version="%prog - 02/03/2008 - v0.1 - Reza Kalfane")
parser.add_option( "-c", "--changelog", action="store_true", dest="changelog", help="display changelog" )
parser.add_option( "-D", action="store", type="string", metavar="USER", dest="user", help="user name (email)" )
parser.add_option( "-w", action="store", type="string", metavar="PASSWD", dest="passwd", help="password" )
parser.add_option( "-W", action="store_true", dest="passwd_i", help="prompt for password" )
parser.add_option( "-k", action="store", metavar="KEY", dest="key", help="key of the user to check (logged user by default)" )
parser.add_option( "-o", action="store", type="choice", metavar="OUTTYPE", dest="out_type", default="txt", help="output type: txt, csv or xml [default: %default]", choices=["txt","csv","xml"] )
(options, args) = parser.parse_args()
3. Once the arguments and the options are parsed from the command-line, you can check that everything is valid, display the changelog if requested, and prompt for the password if needed.
# Display changelog if options.changelog: print "\n".join( changelog ) sys.exit() # Prompt for password if options.passwd_i: options.passwd = getpass.getpass() # Options verifications if options.user == None or options.passwd == None: parser.error( "please specify credentials" )
I used in that script functions I found on the web to convert HTML entities to full unicode strings:
# Transform HTML entities
def substitute_entity(match):
ent = match.group(2)
if match.group(1) == "#":
return unichr(eval("0" + ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()
def decode_htmlentities(string):
entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
return entity_re.subn(substitute_entity, string)[0]
5. From there, you need to simulate a login to LinkedIn website. The main page looks like the following:

Here is the part which is of interest for us - the login form:

Let's look at the source code of the page to see what fields names to use:
<form action="https://www.linkedin.com/secure/login" method="post" accept-charset="UTF-8" name="login">
<table>
<tbody><tr>
<td colspan="3" class="reason" name="reason"></td>
</tr>
<tr>
<td align="right" width="30%"><label for="session_key-login">Email address:</label></td>
<td colspan="2" width="70%"><input name="session_key" value="" id="session_key-login" size="24" type="text"></td>
</tr>
<tr>
<td align="right"><label for="session_password-login">Password:</label></td>
<td colspan="2"><input name="session_password" value="" id="session_password-login" size="24" type="password"></td>
</tr>
<tr valign="top">
<td> </td>
<td><input name="session_login" value="Sign In" class="btn-primary" type="submit"></td>
<td width="20"><a href="http://www.linkedin.com/passwordReset" name="forgotPassword" class="forgotpwd">Forgot password?</a></td>
</tr>
</tbody></table>
<div style="display: none;" id="cookieDisabled">Make sure you have cookies and Javascript enabled in your browser before signing in.</div>
<script type="text/javascript">
if (navigator.cookieEnabled == true) {
if(document.getElementById('cookieDisabled')) document.getElementById('cookieDisabled').style.display = 'none';
}
</script>
<input name="session_login" value="" id="session_login-login" type="hidden"><input name="session_rikey" value="" id="session_rikey-login" type="hidden">
</form>
In the LinkedIn login form, here are the needed fields:
session_keyhandling the user namesession_passwordfor the password of the usersession_loginwhich holds the values empty and "Sign In"session_rikeywhich is empty here
Using the HTTPSConnection class from httplib module, you can connect to https://www.linkedin.com/secure/login, fill the form using the user name and password from Options, submit the form, and get the authentication cookie back from the result. The cookie contains multiple values, including a session ID and information about the logged user, such as the LinkedIn key. You need to store that cookie to use it for later HHTP connections.
# Login
conn = httplib.HTTPSConnection( "www.linkedin.com:443" )
headers = {'Content-type': 'application/x-www-form-urlencoded', 'Accept': 'text/plain'}
params = urllib.urlencode( {'session_key': options.user} ) + '&session_password=' + options.passwd + '&session_login=Sign+In&session_login=&session_rikey='
conn.request( "POST", "/secure/login", params, headers )
response = conn.getresponse()
cookie = response.getheader( "set-cookie" )
mykey = None
match = re.match( "^.*leo_auth_token=LIM:(.*?):.*$", cookie )
if not match:
print "Could not log into LinkedIn!"
sys.exit()
mykey = match.group(1)
if options.key != None:
mykey = options.key
6. Once logged into the LinkedIn website, you can connect to the regular http://www.linkedin.com site and access and retrieve the connections page of the selected profile. This is either the logged user or another user when a key is specified in the options:

# Get connections
result = ""
headers["Cookie"]=cookie
conn = httplib.HTTPConnection( "www.linkedin.com:80" )
conn.request("GET","/profile?viewConns=&key=" + mykey + "&split_page=1","",headers)
response = conn.getresponse()
htmlresult = response.read()
7. The connections page contains the full name of the user, its title, and the list of the connections on multiple pages. You can go through the contents of this page to get the number of connections pages the user has.
# Retrieve user name, title and max connections pages
# from first page
givenname = "?"
familyname = "?"
title = "?"
title_in_next_line = False
splitpage = 1
for line in htmlresult.split( "\n" ):
match1 = re.match( '^.*<span class="given-name">(.*?)</span>.*', line )
match2 = re.match( '^.*<span class="family-name">(.*?)</span>.*', line )
match3 = re.match( '^.*split_page=([0-9]+).*', line )
match4 = re.match( '^.*<p class="title">.*', line )
# Given name found
if match1:
givenname = match1.group(1)
# Family name found
if match2:
familyname = match2.group(1)
# Pages count found
if match3:
maxpage = int( max( re.findall( "split_page=([0-9]+)", line ) ) )
if maxpage > splitpage:
splitpage = maxpage
# Line contains title
if title_in_next_line:
match5 = re.match( '^\s*(.*)', line )
if match5:
title = match5.group(1)
title_in_next_line = False
# Next line contains title
if match4:
title_in_next_line = True
8. If there are multiple pages, the script can navigate through them using the split_page parameter in the URL to retrieve all the HTML pages containing connections.
# Get connections from additional pages
if splitpage > 1:
for i in range( 2, splitpage + 1 ):
conn.request("GET","/profile?viewConns=&key=" + mykey + "&split_page=" + str( i ),"",headers)
response = conn.getresponse()
htmlresult += response.read()
9. Now that you have all the pages of contents, you can cycle through each line of the result to extract the key, name and title and store everything in a dictionary. The key of that dictionary is a tuple based on the full name in uppercase and the unique key.
10. Sort the result by name:
# Build connections dictionary
connections = {}
current_key = ""
current_name = ""
current_title = ""
for line in htmlresult.split( "\n" ):
match1 = re.match( '^.*<span name="connection"><a href=".*?key=(.*?)&.*?">(.*?)</a></span>.*$', line )
match2 = re.match( '^.*<span name="headline" class="headline">(.*?)</span>.*$', line )
if match1:
current_key = match1.group(1)
current_name = decode_htmlentities( match1.group(2) )
if match2:
current_title = decode_htmlentities( match2.group(1) )
connections[ ( current_name.upper(), current_key ) ] = {}
connections[ ( current_name.upper(), current_key ) ][ "name" ] = current_name
connections[ ( current_name.upper(), current_key ) ][ "title" ] = current_title
10. Cycle through the resulting dictionary to export the result. Here is the code used to export as text content:
# Output
output = ""
# txt
if options.out_type == "txt":
output += givenname + " " + familyname + "'s Connections\n"
output += title + "\n\n"
for ( name, key ) in sorted( connections.keys() ):
output += connections[ ( name, key ) ][ "name" ] + " (key=" + key + ")\n"
output += connections[ ( name, key ) ][ "title" ] + "\n\n"
output += str( len( connections ) ) + " connection" + "s"*( len( connections ) > 1 )
Here is the code used to export as CSV content. The first three lines are comments about the user (name, key, title and number of connections):
# csv
elif options.out_type == "csv":
output += "# " + givenname + " " + familyname + "'s Connections\n"
output += "# " + title + "\n"
output += "# " + str( len( connections ) ) + " connection" + "s"*( len( connections ) > 1 ) + "\n"
output += '"key";"name";"title"\n'
for ( name, key ) in sorted( connections.keys() ):
output += '"%s";"%s";"%s"\n' % ( key, connections[ ( name, key ) ][ "name" ], connections[ ( name, key ) ][ "title" ] )
Here is the code to export the result as XML document:
# xml
elif options.out_type == "xml":
output += '<?xml version="1.0" encoding="utf8"?>\n'
output += '<profile id="%s">\n' %mykey
output += '\t<name>%s %s</name>\n' % ( givenname, familyname )
output += '\t<title>%s</title>\n' % title
output += '\t<connections count="%s">\n' % len( connections )
for ( name, key ) in sorted( connections.keys() ):
output += '\t\t<profile id="%s">\n' % key
output += '\t\t\t<name>%s</name>\n' % connections[ ( name, key ) ][ "name" ]
output += '\t\t\t<title>%s</title>\n' % connections[ ( name, key ) ][ "title" ]
output += '\t\t</profile>\n'
output += '\t</connections>\n'
output += "</profile>"
output = re.sub( "&", "&", output )
11. Once you have the final output, you can either display it on the screen or save it in a UTF-8 encoded file:
# Display to standard output or to UTF-8 file if len( args ) == 0: print output else: # UTF-8 file out = file( args[0], "w" ) out.write( codecs.BOM_UTF8 ) out.write( output.encode( "utf-8" ) ) out.close()

From there you have a simple export of connections. You can improve the script to access the Profile page for each connection and retrieve all information there, such as contact email, current and previous employers, skills, education, etc.
Conclusion
Through the LinkedIn Connections example, we have seen in this article how to access and submit content to HTML pages automatically. This can be very useful in doing automated tests, or automatically provisioning a user to a web application when there is no API available. As it relies on the HTML content, and as this content can change over time, the script may stop working at some point.
This is not really the preferred way to integrate to a web site, but it can be nice in demos, Proof-of-Concepts, tests or personal use. Now, let's monitor your LinkedIn connections to see what they are doing!
| Attachment | Size |
|---|---|
| LinkedIn.zip | 158.99 KB |
| LinkedIn_XML_Connections.png | 45.48 KB |
Disclaimer: As with everything else at Cool Solutions, this content is definitely not supported by Novell (so don't even think of calling Support if you try something and it blows up).
It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.
Related Articles
User Comments
Does not seem to work anymore
Submitted by alsteven on 1 November 2011 - 1:00pm.
Looks like a great sample, but it doesn't seem to work anymore... specifically, logging in to LinkedIn is broken. I tried to debug, but I'm stumped: looks like it should work to me.
Can someone help?
- Be the first to comment! To leave a comment you need to Login or Register


1