Simple cloudformation templates to assist in creating ec2 instances for deep racer learning, with automated training start/end and up to 10X savings over training in console (when using ec2 spot instance). This is a wrapper around LarsLL's deepracer-for-cloud https://aws-deepracer-community.github.io/deepracer-for-cloud/ to make it very easy to start training in the AWS Console and take advantage of all amazing tools the deepracer-for-cloud repo gives you.
Training on an EC2 has many advantages:
Check the latest spot pricing for suitable GPU instances via this automated price checker
The below diagram provides an overview of the architecture of DeepRacer on the Spot: -
Training videos playlist: https://www.youtube.com/playlist?list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Overview: https://www.youtube.com/watch?v=GP7IZ6X5QPU&list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Setup and first run: https://www.youtube.com/watch?v=b4GHWZcIB18&list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Edit files: https://www.youtube.com/watch?v=EAFR7FSN4Bo&list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Update track: https://www.youtube.com/watch?v=XgdRSAeAzHk&list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Increment training: https://www.youtube.com/watch?v=9y5wx7fQUgc&list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Move model to console: https://www.youtube.com/watch?v=Fk0XCoE8M6U&list=PL9qmHoKq77dTFS59WjHciNb0a0n0dE8iF
- Log into AWS console and launch Cloud Shell
- run
git clone https://github.com/aws-deepracer-community/deepracer-on-the-spot
- run
cd deepracer-on-the-spot
INPUTS:
- stackName - name of base resource stack (example 'base')
- ip - the IP of the machine you are using. This is needed to allowlist your machine's IPv4 to view the agent training and access our menu resources (can be found here https://www.whatismyip.com/)
Example:
./create-base-resources.sh base 11.111.11.11.1
This will run for around 3 minutes.
The primary purpose of this template is to provide a simple single script to run that sets up all of the prerequisite AWS resources to allow deepracer-for-cloud to run on EC2 instances (https://aws-deepracer-community.github.io/deepracer-for-cloud/). This should only be ran once per sandbox per region. This is accomplished by creating the following:
- S3 bucket
- SNS Topic that has messages published to it in the event of spot instance termination to stop training safely and upload model
- EC2 quota limit increases to be able to run 2 x g4dn.2xlarge spot or on demand instances (See FAQs if AWS query the rationale for the quota request)
- Role used to import finished model into the AWS DeepRacer console
This bash script utilizes the base.resources.yaml template file to provision the above resources. Note - if your public IP later changes (e.g. you reboot your router and your ISP changes your IP address) you can re-run this script with the same stack?Name with the updated ip parameter and the stack will just modify the appropriate config.
INPUTS:
- baseResourcesStackName - stackName from create-base-resource.sh if you ever forget this, you can go to cloudformation and see old stacks
- stackName - name of this stack that will provision an ec2 instance and automatically train deepracer training. This is also the name used for your model when it's imported into the AWS DeepRacer console, and therefore must conform to that naming convention (Up to 64 characters. Valid characters: A-Z, a-z, 0-9, and hypens (-). No spaces or underscores (_).)
- timeToLiveInMinutes - how long you want this ec2 instance to run for. after X minutes, the instance will be terminated. Default: 60, Min:0, Max: 1440 . If you want the instance to stay alive forever, set this value to 0 (caution: you will be charged per hour the instance is running, and you will need to stop/terminate the instance on your own). You can also increase the max time in standard-instance.yaml or spot-instacnce.yaml if you wish to have a model train more than 24 hours.
Example:
./create-standard-instance.sh base firstmodelbase 30
./create-spot-instance.sh base firstmodelspot 30
This will run for around 3 minutes. Viewing the training will starts 5-7 minutes after this completes
Once this script completes, two links will be printed to console that show the visual training of the model and the log links of the training model I.E. ( 3.87.87.207:8080 and http://3.87.87.207:8100/deepracer-menu.html respectively ). Paste these into your browser and wait 5-7 minutes for training to begin. On the visual training page, the link "/racecar/main_camera/zed/rgb/" will look most similar to the DeepRacer Console.
create-standard-instance.sh creates a single on demand ec2 instance. The instance type used is configured as the default in the standard-instance.yaml cloudformation template file. create-spot-instance.sh creates an Autoscaling Group comprised of a desired capacity of a single spot ec2 instance if available. This is a fantastic way to save a lot of money on training DeepRacer models, as training on a g4dn.2xl spot instance can get you 4 workers at $0.22/hour (compared to $3.50/hour for 1 worker in console). Note, deployment may fail if there isn't any spot instances of this size available. Procuring a spot instance is most common outside of US work hours. If training is interrupted by a spot termination, assuming new spot capacity becomes available during your defined training 'timeToLiveInMinutes' a new Spot instance will be created within the autoscaling group and the training will continue from where it terminated. This will create additional files in your S3 bucket, as subsequent training will either add -1 to your folders or will increment the last number if your name ends with a number for both your training (DR_LOCAL_S3_MODEL_PREFIX) and upload (DR_UPLOAD_S3_PREFIX) locations. If you struggle to get spot capacity you could deploy in another region, but if you do this you need to create an AMI (using scripts/create-image-builder.sh).
This script can be executed many times (the DeepRacer console limits you to training a max of 4 concurrent models), with different instance stack names. All the different instances will share the base resources (efs and s3). It is strongly recommended if using spot training and you want to run execute this script multiple times concurrently that you define unique locations for where your custom files are stored (DR_LOCAL_S3_CUSTOM_FILES_PREFIX). This is because the spot interruption handler updated the run.env to be able to continue previous training, failing to alter this could result in overwriting training of one model with the previous progress of a different model being trained concurrently.
Both spot and standard instance requests are launched using a monthly refreshing AMI that is generated in a source AWS account to always grab the newest docker images for robomaker/sagemaker/coach. If you wish to run your own AMI, or run in a region other than us-east-1, use ./create-image-builder.sh to create the monthly refreshing pipeline and update your spot/standard instance bash scripts to use your AMI. NOTE: using your own AMI will incur a charge of ~$1/month because an EC2 instance will be created monthly to update the AMI.
You can also use the menu.sh to start training, modify config files, and run scripts.
Simply run:
./menu.sh
or python3 menu.py
The script stop-training.sh executes 'safe termination' of training by updated the CloudFroamtion stack to gracefully complete the training in 2 minutes time from when the script is ran. This command works for both standard and spot instances. The scripts takes one parameter, the name of the stack used to create the instance (this is the same as the second parameter used to create the instance with either create-standard-instance.sh or create-spot-instance.sh commands). For example: ./stop-instance.sh my-instance-stack-name
. You can also go to cloudformation and manually delete the stack, but if you do that you wont' get graceful termination (i.e. upload of model to DeepRacer Console or S3 upload to your 'upload' bucket)
The script add-access.sh checks if the IP address given as parameter does not exist already in the Network ACLs and then, adds an additional IP address to the security group ingress, it also add an NACL entry. Use: ./add-access.sh <base resources stack name> <stack name> <IP address>
. This is useful if you have multiple locations where you'd want to monitor your training from.
Subscribing email addresses to the 'spot instance interruption notification topic' (the topic is created by the base resources stack)
The script add-interruption-notification-subscription.sh adds an email address to the 'interruption notification topic.'
Use: ./add-interruption-notification-subscription.sh <base resources stack name> <stack name> <email address>
Note, it is also possible to interactively create a subscription on the SNS web console. Adding an email subscription results in an email, with a confirmation link in it, being sent to the email address. Not published message is forwarded to the email prior to the user having confirmed the subscription (by clicking on the link in the original subscription notification email).
The script check-instances.sh provides a list showing the recent instances, their current status and PUBLIC_IP. It is an eazy way where users don't need to track/save the PUBLIC_IP after starting each training ( PUBLIC_IP:8100/deepracer-menu.html ). On the other hand, if the users have executed recently the script stop-training.sh my-instance-stack-name, this my-instance-stack-name instance will be displayed as terminated.
Use. ./check-instances.sh
The script create-image-builder.sh creates an EC2 Image Builder Pipeline that creates a new AMI on the 1st of each month. The resources used to create the images include the community git repository content for deep racing. The drivers/containers are installed and the image is rebooted. This speeds up the instance creation, as the software is preinstalled. create-image-builder.sh takes two parameters, the resources stack name and a stack name for the image builder provisioned template. The resources created are defined in the image-builder.yaml template.
Before running this script please ensure you're using the latest version of the AWS CLI, as this is required to unblock public AMI sharing, otherwise you won't be able to use the image that is created for your training.
The image builder pipeline is invoked at mid-night on the 1st of the month. To avoid waiting for the first AMI to be created, the pipeline can be invoked interactively after it has been created by the provisioned template. Alternatively to avoid the costs of a monthly build (approx $1/region/month where you deploy Image Builder) after deployment you could modify the Image Builder pipeline to run manually in the console, and then trigger it when you want a new build (note - due to Docker cert issues AMIs must be less than 3 months old, otherwise containers don't start, so you should trigger a build at least every 3 months). Once you are using your own AMIs you must replace the standard AWS Account number (747447086422) in create-spot-instance.sh
and create-standard-instance.sh
with your own AWS Account number that hosts the AMIs, otherwise you'll likely get an error as no AMI will be found.
The image builder logs are written into the s3 bucket provided by the 'base resources'. The logs are subject to s3 lifecycle expiration.
Old created AMIs are deleted daily by the Image builder lifecycle policy.
To use cd into scripts directory and run ./create-image-builder.sh <base resources stack name> <stack name>
This script can be used to delete the resources created by the create-base-resources.sh script (and associated template). Please be aware that the resource deletion will fail if the S3 bucket created is not empty. delete-base-resources.sh takes a single mandatory parameter, the stack-name, same value as above.
This script will check the prices of g4dn, g5, g6 and g6e instances of sizes 2xlarge, 4xlarge and 8xlarge in every AWS region and return the results. By default it won't filter the results and will show them ordered by PricePerWorkerHour (i.e. instance price per hour divided by the suggested number of workers the instance can host). Using the --help parameter will show supported values for optional parameters.--sort_order can change the order you sort the list by to alternative values, e.g. 'SpotPrice' and --interruption_filter allows you to filter the list, for example if you only want to see instances with an interruption frequency of '<5%'. to learn more about interruption frequency visit the AWS Spot Instance Advisor
To use cd into scripts directory and run ./get-spot-prices.sh --sort_order '<SORT_ORDER>' --interruption_filter '<INTERRUPTION_FILTER>'
This example will only show instance with less that 5% chance of being interrupted and ordered by Spot Price ./get-spot-prices.sh --sort_order 'SpotPrice' --interruption_filter '<5%'
If you have an issue with training, the best first place to check is CloudFormation "Events" tab to see if there are any errors related to deploying your stack.
Issue | Description |
---|---|
The maximum number of network acl entries has been reached | |
Exception when checking for DEEPRACER_JOB_TYPE_ENV 'Local' is not valid DeepRacerJobType | |
S3 failed, retry count 1/5: An error occurred (404) when calling the HeadObject operation: Not Found | |
model imported to console says track is reInvent:2018 | |
Import model from console to DOTS | aws s3 cp "s3://aws-deepracer-assets-b9436ddf-db0a-4f63/my-deepracer-console-model-name/Mon, 17 Jul 2023 17:53:33 GMT/" "s3://my-base-bucket/my-new-model-name" --recursive DR_LOCAL_S3_PRETRAINED_PREFIX variable to the name of your model. |
How do I train continuous/SAC instead of discrete action space? | |
What if multiple people use the same AWS sandbox? | |
Quota increase hasn't been assigned | |
My model didn't appear in the console after training completed or import failed | |
I want to import into the AWS Console models from when my Spot instance was interrupted | |
My stack doesn't deploy and I get the error 'No export named base-DeepRacerServiceRole found' when trying to start my training | |
My stack doesn't deploy and I get null AMI error when trying to start my training | |
I can no longer access my training when it's running | |
The stack won't deploy due to a duplicate Cloudwatch log name |