This app serves an HTTP API to shorten URLs:
$ curl -H "Content-Type: application/json" -X POST http://localhost:5000/shorten_url -d {
"url": "http://davtyan.org"
}
{
"shortened_url": "http://localhost/i"
}
$ curl -H "Content-Type: application/json" -X GET http://localhost:5000/i
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="HTTP://davtyan.org">HTTP://davtyan.org</a>. If not click the link.
URL shorteners are used to make URLs short and easy to read.
- URLs need to be hard to confuse when hand-written or read out loud, so let's only allow lowercase letters and numbers in URL codes (but not 0 as you can confuse it with O), hence 37 possible characters
- URLs need to be short, so let's use the shortest codes possible. 37 ** x > 30 trillion pages out there, so x = 9 - maximum foreseeable code length. We should start with using shortest codes when possible
- Avoid the situation where URL is reused too soon, as seeing a wrong URL would be perceived as bad service. Let's just delete unused URLs to free up space rather then reuse codes.
The web service is implemented with Flask web framework and uses MongoDB as a storage backend. URLs are inserted into MongoDB and assigned unique incremental integer IDs using a Counters Collection pattern. IDs are mapped to short URL codes and back using a bijective mapping defined in urlcodes.py
, so that long URLs can be quickly retrieved by short codes (storage.py
).
To run the application you will need:
- A MongoDB instance
- Python package dependencies, please refer to
requirements.txt
- For production: Web-server, e.g. nginx
Before running the server, adjust settings.py
to match your Web-server and MongoDB configurations.
Set up MongoDB collections by running:
python storage.py setup
Then you can run the server in test mode:
python server.py
If the service is experiencing high traffic, you may want to horizontally scale MongoDB or/and free up space taken by unused URLs.
The application was designed with hashed sharding in mind. As the URL codes are sequential, hashed sharding will allow to distribute documents/requests evenly among shards. _id
key in URL collection has a hashed index set up, so to shard it you just need to run in MongoDB console:
sh.shardCollection( "<DB>.<URL_COLLECTION>", { "_id" : "hashed" } )
where DB
and URL_COLLECTION
are specified in settings.py
.
To remove URLs that have not been visited for a while, run:
python storage.py cleanup
By default that will remove all URLs in the collection that have not been visited for more than 180 days (you can adjust this interval in settings.py
). You also may want to schedule this command to be ran periodically, e.g. using Cron.
If I had more time to work on this, I would prioritize the following:
- Writing proper tests. I only tested in with
curl
on some trivial cases. - Saving MD5 hashes of URLs so if I have already shortened the given URL I return the old short code instead of creating a new one. I did not do so because searching the entire collection for a document containing a certain string (MD5 hash) would be costly without an index. Most likely this would be as easy as setting up another hashed index on
URL
field but I needed more research to confirm that. - Addressing security concerns: I would research a decent way to avoid XSS injections, etc. I would also add some redundancy into the short URL codes so that they could not be scanned sequentially.
Regarding the scaling note, I omitted all steps I didn't have code for e.g. running multiple servers with a load balancer, etc.