Problem: We can register with a website's developer API so that it allows us to pull from its data APIs, but what about impersonating a specific user so that we can pull from a specific user's account, such as at Facebook or Instagram? We need a way for the program to access private data without acquiring and saving the credentials of all those users. Enter OAuth2.
When it comes to a data collection program (the client), REST APIs for private data will require the program to authenticate using OAuth2. A resource owner, such as a Facebook or twitter user, has data stored on a resource server. The key idea is that the Facebook user should not have to give their login credentials to a client program so that the client can login. Instead, the client program gets an access token from an authorization server (usually just another server at target company) with approval (e.g., logging in with a dialog box) from the resource owner (Facebook user). Given the access token, the client code can access resources from the resource server.
As before, client program developers go to a target company's developer website, such as twitter, and asks for an application ID or some other ID that will allow them access to the API. You might also get a so-called secret key from the company. We have already done this for Zillow and Google's YouTube.
From a user's (resource owner) perspective, here is what the authorization dialog looks like when my client program tries to access data from my LinkedIn account:
I login and then LinkedIn notifies my client program that it has permission to pull data using my account for a short period of time. It gives this permission by passing back an authorization code which my client program exchanges for an access token (basically yet another ID).
The authentication sequence (is a tricky thing to get right but the basic idea is as follows. The flow looks like:
- The client program directs the user's browser to a specific authorization URL at the target company to request access. The client program passes its application ID and usually a secret key. At linkedin it the URL is
https://www.linkedin.com/uas/oauth2/authorization
. The browser gives an opportunity for the user to login. - If successful, the authorization server returns an authorization
code
by redirecting the browser to a server that is part of our client program. This redirected URL must be registered with the company when we created an application ID. As you will see in my examples, authentication requires that your client pretend to be a web server for one HTTP request, to collect to the access token generated by the target company server. Thecode
is only valid for a very short period of time, typically measured in seconds. - The client program then contacts another URL at the authorization server without user interaction to exchange the
code
for andaccess token
. The URL at LinkedIn ishttps://www.linkedin.com/uas/oauth2/accessToken
. Thetoken
is valid for a short period of time, perhaps minutes or hours. - The client program can then make REST API calls using the access token, for a limited period of time.
OAuth is very complicated because of all of the potential threats out there. Here is an explanation of why we need the complicated user login followed by authentication code followed by exchange with access token.
The authorization code is obtained by using an authorization server as an intermediary between the client and resource owner. Instead of requesting authorization directly from the resource owner, the client directs the resource owner to an authorization server (via [browser]), which in turn directs the resource owner back to the client with the authorization code.
Because the resource owner only authenticates with the authorization server, the resource owner's credentials are never shared with the client.
Once we have this authorization code, why do we need to go to yet another server to convert the authorization code to an access token? It is for security reasons. Consider what my URL window looks like after I login with LinkedIn and it redirects me back to a server running on the localhost (part of my client program that is asking for access to my data):
Whoops! Right there in the window is my authorization code. That is insecure because it stored in my browser history and perhaps other areas. That code is only active for a few seconds and is then invalidated by the authentication server.
An authorization code represents the intermediate result of a successful end-user authorization process [...]. Authorization codes are sent to the client's redirection URI instead of tokens [because]:
Browser-based flows expose protocol parameters to potential attackers via URI query parameters (HTTP referrer), the browser cache, or log file entries and could be replayed. In order to reduce this threat, short-lived authorization codes are passed instead of tokens and exchanged for tokens over a more secure direct connection between client and authorization server.
The client program can now make a direct https (secure) connection to the authorization server to obtain an access token.
See the next section for another security concern related to how we get the authentication code back to our client program.
In order to get authentication to work, the authenticating server is going to contact us by calling back to a web server URL we specify in the authentication request. This is kind of annoying because then it means we have to have a Web server running even for a program that is pulling data from the web not serving data to the web.
It's part of the security protocol that they must call us back rather than just return an access token. This is what I do when someone calls me and claims to be from my bank. I ask for their phone number and then validate it through their main phone exchange or looking for that person's work page at the bank. The point is, that I call them back. I never trust someone contacting me out of the blue.
import BaseHTTPServer
import urlparse
class MyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
p = self.path.split('?')
if len(p) > 1:
params = urlparse.parse_qs(p[1], True, True)
print params
self.send_response(200)
self.end_headers()
return
url = 'http://localhost:8000?user=parrt&foo=bar'
server_address = ('localhost', 8000)
httpd = BaseHTTPServer.HTTPServer(server_address, MyHandler)
print url
# Block until we get a single request to url
httpd.handle_request() # handle just one request
Exercise: get this server running and click on the link printed out. The server should not respond buy printing parameters until you click on the link. Next, replace the print
with a statement to open the URL in a web browser. The program should now flow automatically without user interaction.