WIP project to visualize strava data using open source tooling
🏃🚴 🤝 📊
🧰 Tools planned:
Install required dependencies:
uv sync
uv is being used as the preferred package manager for this project; if you are unfamiliar with uv
:
uv sync
=poetry install
uv run my_script.py
=poetry run my_script.py
=source .venv/bin/activate && python my_script.py
uv run dlt --help
=source .venv/bin/activate && dlt --help
Create your own Strava API Application
This is necessary to generate the credentials (e.g. client_id
, client_secret
, etc.) you will supply to the strava resource in strava.py
. Don't worry, creating an application is just a few clicks.
(full details on the process of creating an app are supplied by strava here)
Once your app has been created, it will come with a refresh token and an access token. For the purposes of this setup (i.e. creating an automated dlt
pipeline to export your strava data), these are utterly useless.
- The initial Access Token expires after 6 hours and will need updating (this is how strava manages access tokens)
- The intial Refresh Token is minimally scoped with
read
access which does not allow viewing of activity data (there is a seperateactivity:read
scope that must be granted to a token to view activity data)
To get around these issues, we will create a new refresh token with the proper scopes that we can pass dlt
.
dlt
can then use this refresh token that has already been authorized to fetch an up-to-date access token whenever it runs.
(these steps were ripped from the Strava developer docs)
-
Paste the Client ID from your app into this URL where
[REPLACE_WITH_YOUR_CLIENT_ID]
is, and specify the required scopes where[INSERT_SCOPES]
is:https://www.strava.com/oauth/authorize?client_id=[REPLACE_WITH_YOUR_CLIENT_ID]&response_type=code&redirect_uri=http://localhost/exchange_token&approval_prompt=force&scope=[INSERT_SCOPES]
note: in order to get full read access to both your public and private data, these are the necessary scopes
read_all,activity:read_all,profile:read_all
-
Paste the updated URL into a browser, hit enter, and follow authorization prompts.
-
After authorizaiton, you should get a "This site can't be reached" error. Copy the authorization code (i.e. the
code=
value) from the returned URL.
- Run a cURL request to generate the refresh token that
dlt
will use:curl -X POST https://www.strava.com/oauth/token \ -F client_id=YOURCLIENTID \ -F client_secret=YOURCLIENTSECRET \ -F code=AUTHORIZATIONCODE \ -F grant_type=authorization_code
- If successful, the response will return a json payload with a
refresh_token
(save this)
dlt
offers various methods for storing connection credentials;
shown below is an example secrets.toml
for the strava connection secrets:
[sources.strava.credentials]
access_token_url = "https://www.strava.com/oauth/token"
client_id = "<your_client_id>"
client_secret = "<your_client_secret>"
refresh_token = "<your_refresh_token>"
With crendentials defined, the strava datastack dlt pipeline can be run via:
uv run strava.py
By default, this will just load the last 30 days of data.
This is done as Strava limits read requests to 1000/day. Depending on how many years of Strava data you have / number of total activities, you may need to break up your initial historical load across multiple days. It's recommended to start with a few months at first to get a sense for how many requests that is. Fair warning: if your timespan is too large, and you hit the daily request limit, the pipeline will error and you will need to retry loading all that data (modified with a smaller time window) the following day.
To complete a larger, historical load, or run a backfill, you can pass in a --start-date
and --end-date
.
Initial load with data starting from January 2024*
uv run strava.py --start-date='2024-01-01'
Backfill data starting from January 2024 up until July 2024
uv run strava.py --start-date='2024-01-01' --end-date='2024-07-01'
*Note: if you have already run the pipeline once, you will have a last_value
saved in your state, and you will need to provide an --end-date
to temporarily override this (if you don't, your pipeline will think it is already up to date and not properly backfill)
You can check this stored last_value
at any time by making use of the -v
flag:
uv run dlt pipeline -v strava_datastack info
An example of the relevant output as it relates to last_value
:
sources:
{
"strava": {
"resources": {
"activities": {
"incremental": {
"start_date": {
"initial_value": "2024-11-16T00:00:00Z",
"last_value": "2024-12-15T20:48:45Z",
"unique_hashes": [
"k3Za3tV8zvNDwCyexCJQ"
]
}
}
}
}
}
}
Local state:
first_run: False
_last_extracted_at: 2024-12-16 05:00:59.894989+00:00
_last_extracted_hash: QeJ5iEmeeGmB3zs5XxaM1oRvcARLpgsmifRxmxj84Og=