This application was created by Brian Schaefer for The Data Incubator (Fall 2020 cohort).
BikeRank applies the TrueSkill™ Ranking System to amateur road cycling races.
The interactive website at http://bike-rank.herokuapp.com/ allows users to explore the skill rankings for >100,000 cyclists. Users can either view how the ratings are updated for each racer in a specific race, or how the rating for a single racer changes over time.
The website may take 30 seconds to load (both initially and after fetching race/racer results). Please be patient and excuse the delay!
As of October 2021, I have reached the end of my AWS RDS Free Tier Period and am no longer maintaining the Heroku app. I would encourage those interested in exploring this project to clone the repo and run the code locally (see below).
This project combines an assortment of techniques new to the author, each explained briefly below.
Results for over 12,000 bike races are available at URLs like https://results.bikereg.com/race/11456,
where the race ID number ranges from 1-12649 (as of November 2020).
I use requests_futures
to asynchronously obtain the text of these webpages (scraping.get_futures
) and use regular expressions to extract the name, date, and location for each race (scraping.scrape_race_page
).
Hidden within each of these pages is a link to a JSON file containing the results for that race.
I again use requests_futures
to download the contents of these JSON files (scraping.get_results_futures
) and convert them into Python dictionaries (scraping.scrape_results_json
).
The relevant data for this project are stored in a PostgreSQL database hosted on AWS with three tables:
Races
: Each row corresponds to one race event identified by a uniquerace_id
. This table stores the relevant metadata for each race, including its name, date, location, list of race categories, and the number of racers competing in each category.Racers
: Each row corresponds to one racer identified by a uniqueRacerID
. This table primarily stores the racer's name and current skill rating (parameterized by a mean skill ratingmu
and uncertaintysigma
). The ratings in this table are updated upon processing each additional race.Results
: Each row corresponds to one result: the finishing place for one racer in one category of one race, along with the corresponding ID numbers for each. This table also records the skill rating of each racer both prior to (prior_mu
,prior_sigma
) and as a result of (mu
,sigma
) the race outcome.
In model.py
, I represent the tables using SQLAlchemy classes. There are a variety of helper functions defined here to query and update the database.
I have adapted the Python implementation of TrueSkill to determine skill ratings for the racers represented in the dataset. This algorithm compares the skill ratings of racers involved in each race and evaluates the final results considering its prior knowledge of each racer's relative skill. For more information about how the algorithm works, please see this article.
While the mathematics behind TrueSkill are relatively complex, updating ratings is straightforward: TrueSkill receives a list of the skill ratings as input and returns a list of updated ratings.
results.get_all_ratings
iterates through each category of each race in chronological order and applies TrueSkill (results.run_trueskill
) to all placing racers. results.get_predicted_places
predicts the finishing place for a group of racers by ordering their ratings - the racer with the highest rating is predicted to finish in 1st place, and so on.
The website is a Flask application deployed on Heroku with a single user-facing webpage.
For troubleshooting, I set up the /database
URL to display the first 2000 rows of each table in the database. The parameters table
and start
can be used to specify which table to query and from what index to start showing results (e.g. ?table=Races&start=23
). If the table
parameter is not specified, the page displays the Results
table, and if the start
parameter is not specified, the rows start from index 0.
The Heroku app uses a production configuration (see config.py
) which prevents users
from altering the database. In a development configuration, the following parameters can be used to alter the database using the /database
URL:
drop
: eitherTrue
or comma-separated table names (e.g.Races,Results
). Will drop listed tables (all tables ifTrue
) and re-create empty tables with the appropriate schema, using the functionscommands.db_drop_all
andcommands.db_create_all
.add
: eitherTrue
or comma-separated table names (e.g.Races,Results
). Will attempt to add rows to the listed tables (all tables ifTrue
) by scraping each BikeReg race page and/or results JSON. This parameter calls theadd_table
method for each table.subset
: two comma-separated integers (e.g.subset=1,1000
) indicating the range ofrace_id
s to add to theRaces
table. If not specified, the range ofrace_id
s will be1,13000
.rate
: ifTrue
, will apply TrueSkill to all results in the database, regardless of whether the results have been rated already or not.limit
: integer specifying the number ofResults
rows to rate, for debugging purposes.
Follow these steps to get the website running on your local machine:
git clone
the repositorypip install -r requirements.txt
- Install PostgreSQL and create a database.
- Create a
.env
file in the root directory of the project with the following contents:
APP_SETTINGS=config.DevelopmentConfig
DATABASE_URL=postgres://<username>:<password>@<host>:<port>/<db_name>
SECRET_KEY=<secret_key_here>
- Execute
flask db-create-all
to create all tables in the database (executeflask db-drop-all
first if tables already exist in the database). - Run the Flask app with
flask run
. - Navigate to
localhost:5000/database
. At this point, theResults
table is empty, so you should only see the column names. - Navigate to
localhost:5000/database?add=True&subset=1,1000
to add data to (in order) theRaces
,Results
, andRacers
tables. As explained above, thesubset
parameter (optional) can be used to limit the number of races considered and should be omitted to add the entire dataset. - Finally, navigate to
localhost:5000
to view the user-facing interface and explore the results!