We often see a lot of questions that surface repeatedly. This repository is an attempt to gather some of those and provide some answers!
The solution was primarily designed for developers who are using Linux and macOS. However, it's possible also to make it working on Windows.
Basic scenarios, which require only SSM without SSH, work on Windows without any additional configuration.
To be able to connect from your local machine with SSH and start port forwarding with
the scripts like sm-local-ssh-ide
and sm-local-ssh-training
, please consider that
you need Bash interpreter to execute them. They don't work in PowerShell.
We recommend obtaining Bash by installing Git for Windows distribution. The next steps are:
-
Run "Git Bash" application as Administrator.
-
Find the path where pip has installed your library and execute
sm-local-install-force
once:
$ cd ~/AppData/Local/Packages/PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0/LocalCache/local-packages/Python310/site-packages/sagemaker_ssh_helper
$ ./sm-local-install-force
- Now you may close Git Bash and start again as a normal user.
- Don't forget to repeat steps 1-4 after you install a new version of SageMaker SSH Helper.
The scripts like sm-local-ssh-ide
and sm-local-ssh-training
will now work from the
Git Bash session under a regular user, and you may continue to work in your local IDE
on Windows as usual.
Yes, the setup is similar to SageMaker Studio. Run SageMaker_SSH_Notebook.ipynb on the notebook instance and sm-local-ssh-notebook connect <<notebook-instance-name>>
your local machine.
Review the instructions for SageMaker Studio integration with PyCharm / VSCode for the rest of details.
Indeed, when you run a SageMaker job, there are no EC2 instances or generic containers visible in AWS console, because the instances and containers are managed by the SageMaker service. The trick that SageMaker SSH Helper is using is the hybrid activations, with SageMaker containers effectively becoming managed instances when SSM agent starts, akin to on-premises instances. The managed instances have an ID that starts with 'mi-' prefix and once they connect to the Systems Manager, you're able to see them in AWS Console in Systems Manager -> Node Manager -> Fleet Manager section.
SageMaker Warm Pools is a built-in SageMaker Training feature which is great when you want to use the SageMaker API to:
- Run a series of relatively short training jobs, each job outputting a different model based on different input data (like a model per customer).
- Interactively iterate over a series of training jobs, changing code and hyperparameters between jobs. Job launch time will be less than 30sec. When using warm pools, all training jobs are audited and logged. Warm Pools is a built-in product feature, which you can use after you opt in.
You’re billed as long as the warm pool didn't expire.
SageMaker SSH Helper is a field solution for SageMaker, focused on interactive work. Enabling use cases like:
- The shell access to the SageMaker training container to monitor and troubleshoot using OS tools.
- Setup remote development/debugging experience, using your IDE to code, and run processes in the SageMaker container.
SSH Helper's interactive nature allows you to iterate in seconds, by running multiple commands/experiment reusing one running training job. SSH Helper requires setting up your AWS account with IAM and SSM configuration. You’re billed as long the training job is running.
Start a SageMaker Training job that will run a dummy training script which sleeps forever, then use remote development
to carry out any activities on the training container. Note, this idea and the script train_placeholder.py
is also
introduced in the documentation in the section Remote code execution with PyCharm / VSCode over SSH.
Yes, requires adding same IAM permissions to SageMaker role as described in the IAM_SSM_Setup.md for your local role (section 3).
This logic is enforced by IAM policy. See the manual step 3 in IAM_SSM_Setup.md for a policy example.
It works as follows: the SageMaker SSH Helper assigns on behalf of the user the tag SSHOwner
with the value that equals a local user ID (see the source code for SSH wrappers).
For integration with SageMaker Studio the user ID is passed in the notebook as the argument to
sm-ssh-ide init-ssm
command.
When a user attempts to connect to an instance, IAM will authorize the user based
on their ID and the value of the SSHOwner
tag. The user will be denied to access the instance
if the instance doesn't belong to them.
Another important part of it is the IAM policy with ssm:AddTagsToResource
action, described in the manual step 2.
Limiting this action only to SageMaker role as a resource will allow adding and updating tags only for
the newly created activations (instances) and not for existing ones that may already belong to other users.
When the job is failing, the container stops, and you're not able to connect to the container anymore with SSM or SSH.
To troubleshoot this situation, use the trick with train_placeholder.py
described in the README.md, and start your training script from the terminal or with the remote interpreter while the dummy script is running.
A variation of this solution is to create a wrapper script, which executes your training script and sleeps forever after the training script exits. This is how after getting an error in CloudWatch you will be able to connect to the same container with SSM/SSH and troubleshoot its state.
Yes, it's fine. They don't contain any of your local data. These are the freshly created folders by the VNC server and XFC4 remote desktop environment. You will see them if you connect to SageMaker Studio with VNC client after running sm-local-ssh-ide
command, as described in the IDE integration section of the documentation.
The public key is transferred to the container through the default SageMaker bucket with the S3 URI that looks
like s3://sagemaker-eu-west-1-555555555555/ssh-authorized-keys/
.
If you want to change the location to your own bucket and path, export the variable like this:
export SSH_AUTHORIZED_KEYS_PATH=s3://DOC-EXAMPLE-BUCKET/ssh-public-keys-jane-doe/
sm-local-ssh-ide <<kernel_gateway_app_name>>
sm-local-ssh-training connect <<training_job_name>>
What if I want to train and deploy a model as a simple Estimator in my own container, without passing entry_point and source_dir?
In some cases, you don't want to pass any external scripts or libraries to the training job or the inference endpoint and want to put everything into your own container during the build time.
In this case, make sure that SageMaker SSH Helper is installed in your Dockerfile
:
RUN pip --no-cache-dir install sagemaker-ssh-helper # <--NEW--
Important: Make sure that the version installed into the container matches the version of the library on your local machine.
The code for running estimators and inference will look like this:
from sagemaker.estimator import Estimator
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper, SSHModelWrapper # <--NEW--
role = ...
estimator = Estimator(image_uri=f"555555555555.dkr.ecr.eu-west-1.amazonaws.com/byoc-ssh:latest",
role=role,
instance_count=1,
instance_type='ml.m5.xlarge',
max_run=60 * 30)
training_input = ...
ssh_wrapper = SSHEstimatorWrapper.create(estimator, connection_wait_time_seconds=600) # <--NEW--
estimator.fit({'training': training_input}, wait=False)
...
model = estimator.create_model()
ssh_model_wrapper = SSHModelWrapper.create(model, connection_wait_time_seconds=0) # <--NEW--
endpoint_name = ...
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.m5.xlarge',
endpoint_name=endpoint_name,
wait=True)
What if I want to deploy a Multi Data Model without passing a reference to a Model object, only with image_uri?
In this case, you either manually pack the inference code to your model artifact during training or provide all the inference code inside the inference image.
Your code then should look like this:
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.pytorch import PyTorchPredictor
from sagemaker_ssh_helper.wrapper import SSHMultiModelWrapper # <--NEW--
model_data_prefix = "s3://DOC-EXAMPLE-BUCKET/mms/"
repacked_model_data = ...
model_name = ...
model_role = ...
model_path = ...
endpoint_name = ...
mdm = MultiDataModel(
name=model_name,
model_data_prefix=model_data_prefix,
image_uri='555555555555.dkr.ecr.eu-west-1.amazonaws.com/byoc:latest',
role=model_role
)
ssh_wrapper = SSHMultiModelWrapper.create(mdm, connection_wait_time_seconds=0) # <--NEW--
mdm.deploy(initial_instance_count=1,
instance_type='ml.m5.xlarge',
wait=True)
predictor = PyTorchPredictor(endpoint_name)
mdm.add_model(model_data_source=repacked_model_data, model_data_path=model_path)
predicted_value = predictor.predict(data=..., target_model=model_path)
Note: Your repacked model should also contain the SageMaker SSH Helper library files, and you need to import and start it from your inference script just as you do for the normal inference.
See #7 for this request.
What if I want to use an estimator in a hyperparameter tuning job (HPO) and connect to a stuck training job with SSM?
In this case, wrapper.get_instance_ids()
won't really work because you don't call fit()
directly on the estimator and SSH Helper does not understand what training job you are trying to connect to.
You should use extra lower-level APIs to fetch the training job name of your interest first, and then either use SSMManager
(recommended) or SSHLog
(slower) to fetch their instance ids from the code:
import time
from sagemaker.mxnet import MXNet
from sagemaker.tuner import HyperparameterTuner
from sagemaker_ssh_helper.manager import SSMManager
from sagemaker_ssh_helper.wrapper import SSHEstimatorWrapper
estimator = MXNet(...)
_ = SSHEstimatorWrapper.create(estimator, connection_wait_time_seconds=0)
objective_metric_name = ...
hyperparameter_ranges = ...
metric_definitions = ...
tuner = HyperparameterTuner(estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
...
)
tuner.fit(wait=False)
time.sleep(15) # allow training jobs to start
analytics = tuner.analytics()
training_jobs = analytics.training_job_summaries()
training_job_name = training_jobs[0]['TrainingJobName']
instance_ids = SSMManager().get_training_instance_ids(training_job_name, 300)
print(f'To connect over SSM run: aws ssm start-session --target {instance_ids[0]}')
Note: If you want to connect to a stuck training job from the command line with SSH, use sm-local-ssh-training
script, as for any other regular training job.
Define the SSH wrapper as usual, e.g.:
import boto3
import sagemaker
from sagemaker.pytorch import PyTorchProcessor
from sagemaker_ssh_helper.wrapper import SSHProcessorWrapper
wait_time = ...
role = ...
boto3_session = boto3.session.Session(region_name='eu-west-2') # <-- AWS Region override
sagemaker_session = sagemaker.Session(boto_session=boto3_session)
torch_processor = PyTorchProcessor(
sagemaker_session=sagemaker_session,
base_job_name='ssh-pytorch-processing',
framework_version='1.9.1',
py_version='py38',
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=60 * 60 * 3,
)
ssh_wrapper = SSHProcessorWrapper.create(torch_processor, connection_wait_time_seconds=wait_time)
When calling ssh_wrapper.get_instance_ids()
, the region will be taken automatically from the wrapper.
However, when you connect to the containers from CLI with aws ssm start-session
command or with sm-local-ssh-*
commands, you need to redefine the AWS Region like this:
AWS_REGION=eu-west-2 AWS_DEFAULT_REGION=eu-west-2 \
sm-local-ssh-processing connect <<processing_job_name>>
We set both AWS_REGION
and AWS_DEFAULT_REGION
, because depending on your environment, either of this variable can be already set, and it can override another one.
You can control AWS CLI settings with environment variables, in particular, this is how to select the AWS CLI profile with sm-local-ssh-*
tools:
AWS_PROFILE=<<profile_name>> sm-local-ssh-ide <<kernel_gateway_app_name>>
There’s plenty of methods already available for you to automate everything. Take a loot at the end-to-end automated tests as an example.
There's get_instance_ids()
method already mentioned in the documentation. Underlying automation methods are available in the SSMManager class and the SSHLog class.
Also check the method start_ssm_connection_and_continue()
from the SSHEnvironmentWrapper class - it automates creating the SSH tunnel, running remote commands and stopping the waiting loop as well as graceful disconnect. Underlying implementation is in the SSMProxy class.
I’m getting an API throttling error in the logs: An error occurred (ThrottlingException) when calling the CreateActivation operation (reached max retries: 4): Rate exceeded
This error happens when too many instances are trying to register to SSM at the same time - This will likely happen when you run a SageMaker training job with multiple instances.
As a workaround, for SageMaker training job, you should connect to any of the nodes that successfully registered in SSM (say “algo-1”), then from there you could hope over to other nodes with the existing passwordless SSH.
You could also submit an AWS Support ticket to increase the API rate limit, but for the reason stated above, we don’t think that’s needed.
Login into the container and run:
tail -f /var/log/amazon/ssm/amazon-ssm-agent.log
How can I clean up System Manager after receiving ERROR Registration failed due to error registering the instance with AWS SSM. RegistrationLimitExceeded: Registration limit of 20000 reached for SSM On-prem managed instances.
SageMaker containers are transient in nature. SM SSH Helper registers this container to SSM as a "managed instances". Currently, there's no built-in mechanism to deregister them when a job is completed. This accumulation of registrations might cause you to arrive at an SSM registration limit. To resolve this, consider cleaning up stale, SM SSH Helper related registrations, manually via the UI, or using deregister_old_instances_from_ssm.py.
WARNING: you should be careful NOT deregister managed instances that are not related to SM SSH Helper. deregister_old_instances_from_ssm.py includes a number of filters to deregister only SM SSH Helper relevant managed instances. It's recommended you review the current registered manage instances in the AWS Console Fleet manager, before actually removing them.
Deregistering requires an administrator / power user IAM privileges.
There's a big delay between getting the mi-* instance ID and until I can successfully start a session to the container.
This can happen if there's SSM API throttling taking place during instance initialization. In such a case, after you are able to shell into the container you'll be able to identify this by grepping for this printout during SSM agent initialization:
grep Throttling /var/log/amazon/ssm/amazon-ssm-agent.log
2022-12-15 12:37:17 INFO [ssm-agent-worker] Entering SSM Agent hibernate - ThrottlingException: Rate exceeded status code: 400, request id: 56ae2c79-bb35-4903-ab49-59cf9e131aca
You should submit an AWS Support ticket to identify the relevant API limit and increase it.
An error message looks like this:
An error occurred (BadRequest) when calling the StartSession operation: Enable advanced-instances tier to use Session Manager with your on-premises instances
First, check that your instance shows as an advanced instance in Fleet Manager. If it doesn't show up there, you've probably missed the manual step 1 in IAM_SSM_Setup.md.
Also check that you're connecting from the same AWS region. Run the following command on your local machine and check that the region is the same as in your AWS console:
aws configure list | grep region