Your Cloudwatch for Heroku. Monitor dyno errors and performance across all your heroku apps and take automated actions!
This simple python / django application is aimed towards making the post-deployment lifecycle easy and automated for the end users. Here are the main use-cases we came across that motivated us to build this app:
- Elusive Dyno Level Errors
Heroku dyno level errors (R13
,R14
,H12
,H10
) are not easy to capture as they show up only in application logs and the metrics dashboard but they heavily impact app performance - Memory Leaking Applications
You may have an app that is leaking memory and you do not have the bandwidth or the need to identify the source of the leak. A quicker fix can be to restart the dyno when its RAM quota exceeds - Watchdog
A watchdog app constantly looks at the status of your critical web services, or databases so that you are alerted in case of SOS. An ideal watchdog takes automated corrective actions - Dyno Metrics
Heroku outputs dyno metrics like%CPU
andmemory
when you enable them, but this data is not accessible in a queryable format for later analysis
In its final form, the monitoring suite will contain the following:
- ✔️ Monitor and catch Generic Dyno Errors (R13, R14 memory errors and other Rxx errors). You can configure the Rxx Error Rules from the admin interface
- ✔️ Monitor and catch Web Specific Dyno Error (H12, H13, H18 and other Hxx errors). Configuration via admin interface
- ⏳ Monitor Web Dyno Failed Requests (5xx errors)
- ✔️ Email alerts Setup email by entering your SMTP server details and recipients. Then configure your alert rules from the admin dashboard
- ⏳ SMS alerts Configure SMS alerts by entering your Infobip SMS provider details. If you use some other service (twilio, msg91 etc) you can easily plugin your own implementation
- ✔️ Restart actions Perform basic recovery actions like dyno restart or app restart when a certain alert is breached
- 🔜 Web server up / down status checks. Specify the endpoint to check, frequency and timeout
- 🔜 Redis instance availability status checks. Specify redis url, list name & threshold (for list / queue length monitoring if required), frequency and timeout
- 🔜 Postgres instance availability status check. Specify the database url, table name (to check for existence if required), frequency and timeout
- 🔜 RAM usage metric per dyno type. Collection and logging
- 🔜 Load %CPU per dyno type. Collection and logging
Metrics Collection only applies to dynos that have metrics logging enabled
To quickly get started and test the app, you can deploy this application on your heroku account servers. Fill up the pre-requisite environment variables and your app should be up within 5 minutes
Post deployment the app
- auto-detects all existing heroku apps and dynos in your account
- makes the admin interface available at
https://<your-app-name>.herokuapp.com/admin/
-
Required environment variables
HEROKU_API_KEY
- To access the logs from different heroku apps in your accountDJANGO_SUPERUSER_USERNAME
,DJANGO_SUPERUSER_PASSWORD
,DJANGO_SUPERUSER_EMAIL
- Superuser credentials to access admin interface
-
Configure emails (optional)
ENABLE_EMAILS
0: disable (default), 1: enableEMAIL_HOST
Example email-server.abc.comEMAIL_PORT
Example 587EMAIL_HOST_PASSWORD
SMTP passwordEMAIL_HOST_USER
SMTP usernameSERVER_EMAIL
or the sender's emailRECIPIENTS
list of recipient email addresses in comma seperated format (no spaces) Example: admin.one@abc.com,admin.two@abc.com
- Login to the admin panel - it will be located at
https://<your-app-name>.herokuapp.com/admin/
- You should see
apps
anddynos
auto-detected from your heroku account - You can now start adding rules!
- To detect and act on an
R14
error on a certain dyno, add anRxx Error
- Pick the dyno on which you want to apply the error
- Pick the error category. In our case
R14
- Enter the least count which is the minimum number of occurrences required for the rule to be considered breached, within the time window number of seconds
- Check email alert if you want email alerts (Requires mail to be configured)
- Pick the action to be taken when the alert condition is breached. Possible options are
no-action
,restart-dyno
andrestart-app
- Save the rule and you are done!
For email sending per topic (app:dyno:error category), a cooling period applies which can be configured from the
EMAIL_COOLING_PERIOD_PER_TOPIC
environment variable
Also for app restart and dyno restart actions respective cooling periods apply. They can be configured viaAPP_RESTART_COOLING_PERIOD
andDYNO_RESTART_COOLING_PERIOD
respectively
This is a django based application. You need the following:
Python 3
- we have verified it on Python 3.6virtualenv
- or similar program to manage your virtual environmentpostgres
- small instance. RAM required less than 1GBredis
- small instance less than 25 MB storage
-
Create a virtual environment and install all dependencies
pip install -r requirements.txt
-
Add environment variables
- Required environment variables
DATABASE_URL
containing the URL of your postgres databaseREDIS_URL
containing the URL of your redis instance
-
Ensure you are able to run the server using
# Dev server python manage.py runserver # Deployment server gunicorn dynomonitor.wsgi
-
Run the initialization script as follows:
bash initialize.sh
It will
- create tables in postgres,
- setup the superuser account and
- auto-detect your heroku apps and dynos
-
(Optional) You can create another superuser using:
python manage.py createsuperuser
-
(Optional) You can rerun auto detection of heroku apps too:
python manage.py shell --command="from utils.rule_helper import auto_detect_heroku_apps; auto_detect_heroku_apps()"
-
Finally, run the server again, and access the admin interface at
http://127.0.0.1:8000/admin
from your browser
If you want to add more features or notice a bug, feel free to report them as issues and create PRs. Your contributions are welcome!
-
I am not able to deploy to heroku / run the app locally. What should I do?
- Please raise an issue and share the error logs you are getting. We will try our best to help :)
-
Which logs are accessed to extract the errors & metrics info?
- Please refer this wiki on log access method and log samples
The following members have actively contributed to the source code and this repository:
- MIT license
- Copyright (c) 2020 Carnot Technologies Pvt Ltd