Skip to content

Latest commit

 

History

History
370 lines (273 loc) · 9.84 KB

ENVIRONMENT_VARIABLES.md

File metadata and controls

370 lines (273 loc) · 9.84 KB

Environment variables

SM_MODEL_DIR

SM_MODEL_DIR=/opt/ml/model

When the training job finishes, the container and its file system will be deleted, with the exception of the /opt/ml/model and /opt/ml/output directories. Use /opt/ml/model to save the model checkpoints. These checkpoints will be uploaded to the default S3 bucket.

import os

# using it in argparse
parser.add_argument('model_dir', type=str, default=os.environ['SM_MODEL_DIR'])

# using it as variable
model_dir = os.environ['SM_MODEL_DIR']

# saving checkpoints to model dir in chainer
serializers.save_npz(os.path.join(os.environ['SM_MODEL_DIR'], 'model.npz'), model)

For more information, see: How Amazon SageMaker Processes Training Output.

SM_CHANNELS

SM_CHANNELS='["testing","training"]'

Contains the list of input data channels in the container.

When you run training, you can partition your training data into different logical "channels". Depending on your problem, some common channel ideas are: "training", "testing", "evaluation" or "images" and "labels".

SM_CHANNELS includes the name of the available channels in the container as a JSON-encoded list.

import os
import json

# using it in argparse
parser.add_argument('channel_names', default=json.loads(os.environ['SM_CHANNELS'])))

# using it as variable
channel_names = json.loads(os.environ['SM_CHANNELS']))

For more information, see: Channel.

SM_CHANNEL_{channel_name}

SM_CHANNEL_TRAINING='/opt/ml/input/data/training'
SM_CHANNEL_TESTING='/opt/ml/input/data/testing'

Contains the directory where the channel named channel_name is located in the container.

import os
import json

parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TESTING'])


args = parser.parse_args()

train_file = np.load(os.path.join(args.train, 'train.npz'))
test_file = np.load(os.path.join(args.test, 'test.npz'))

SM_HPS

SM_HPS='{"batch-size": "256", "learning-rate": "0.0001","communicator": "pure_nccl"}'

Contains a JSON-encoded dictionary with the user-provided hyperparameters.

import os
import json

hyperparameters = json.loads(os.environ['SM_HPS'])
# {"batch-size": 256, "learning-rate": 0.0001, "communicator": "pure_nccl"}

SM_HP_{hyperparameter_name}

SM_HP_LEARNING-RATE=0.0001
SM_HP_BATCH-SIZE=10000
SM_HP_COMMUNICATOR=pure_nccl

Contains value of the hyperparameter named hyperparameter_name.

learning_rate = float(os.environ['SM_HP_LEARNING-RATE'])
batch_size = int(os.environ['SM_HP_BATCH-SIZE'])
comminicator = os.environ['SM_HP_COMMUNICATOR']

SM_CURRENT_HOST

SM_CURRENT_HOST=algo-1

The name of the current container on the container network.

import os

# using it in argparse
parser.add_argument('current_host', type=str, default=os.environ['SM_CURRENT_HOST'])

# using it as variable
current_host = os.environ['SM_CURRENT_HOST']

SM_HOSTS

SM_HOSTS='["algo-1","algo-2"]'

JSON-encoded list containing all the hosts.

import os
import json

# using it in argparse
parser.add_argument('hosts', type=str, default=json.loads(os.environ['SM_HOSTS']))

# using it as variable
hosts = json.loads(os.environ['SM_HOSTS'])

SM_NUM_GPUS

SM_NUM_GPUS=1

The number of GPUs available in the current container.

import os

# using it in argparse
parser.add_argument('num_gpus', type=int, default=os.environ['SM_NUM_GPUS'])

# using it as variable
num_gpus = int(os.environ['SM_NUM_GPUS'])

SM_NUM_NEURONS

SM_NUM_NEURONS=1

The number of Neuron Cores available in the current container.

import os

# using it in argparse
parser.add_argument('num_neurons', type=int, default=os.environ['SM_NUM_NEURONS'])

# using it as variable
num_neurons = int(os.environ['SM_NUM_NEURONS'])

SM_NUM_CPUS

SM_NUM_CPUS=32

The number of CPUs available in the current container.

# using it in argparse
parser.add_argument('num_cpus', type=int, default=os.environ['SM_NUM_CPUS'])

# using it as variable
num_cpus = int(os.environ['SM_NUM_CPUS'])

SM_LOG_LEVEL

SM_LOG_LEVEL=20

The current log level in the container.

import os
import logging

logger = logging.getLogger(__name__)

logger.setLevel(int(os.environ.get('SM_LOG_LEVEL', logging.INFO)))

SM_NETWORK_INTERFACE_NAME

SM_NETWORK_INTERFACE_NAME=ethwe

Name of the network interface. (Useful for distributed training.)

# using it in argparse
parser.add_argument('network_interface', type=str, default=os.environ['SM_NETWORK_INTERFACE_NAME'])

# using it as variable
network_interface = os.environ['SM_NETWORK_INTERFACE_NAME']

SM_USER_ARGS

SM_USER_ARGS='["--batch-size","256","--learning_rate","0.0001","--communicator","pure_nccl"]'

JSON-encoded list with the script arguments provided for training.

SM_INPUT_DIR

SM_INPUT_DIR=/opt/ml/input/

The path of the input directory, e.g. /opt/ml/input/. The input directory is the directory where SageMaker saves input data and configuration files before and during training.

SM_INPUT_CONFIG_DIR

SM_INPUT_CONFIG_DIR=/opt/ml/input/config

The directory where standard SageMaker configuration files are located, e.g. /opt/ml/input/config/.

SageMaker training creates the following files in this folder when training starts:

  • hyperparameters.json: Amazon SageMaker makes the hyperparameters in a CreateTrainingJob request available in this file.
  • inputdataconfig.json: You specify data channel information in the InputDataConfig parameter in a CreateTrainingJob request. Amazon SageMaker makes this information available in this file.
  • resourceconfig.json: name of the current host and all host containers in the training.

For more information about these files, see: How Amazon SageMaker Provides Training Information.

SM_RESOURCE_CONFIG

SM_RESOURCE_CONFIG='{"current_host":"algo-1","hosts":["algo-1","algo-2"]}'

The contents from /opt/ml/input/config/resourceconfig.json. It has the following keys:

  • current_host: The name of the current container on the container network. For example, 'algo-1'.
  • hosts: The list of names of all containers on the container network, sorted lexicographically. For example, ['algo-1', 'algo-2', 'algo-3'] for a three-node cluster.

For more information about resourceconfig.json, see: Distributed Training Configuration

SM_INPUT_DATA_CONFIG

SM_INPUT_DATA_CONFIG='{
    "testing": {
        "RecordWrapperType": "None",
        "S3DistributionType": "FullyReplicated",
        "TrainingInputMode": "File"
    },
    "training": {
        "RecordWrapperType": "None",
        "S3DistributionType": "FullyReplicated",
        "TrainingInputMode": "File"
    }
}'

Input data configuration from /opt/ml/input/config/inputdataconfig.json.

For more information about inpudataconfig.json, see: Input Data Configuration

SM_TRAINING_ENV

SM_TRAINING_ENV='
{
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/testing",
        "train": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_chainer_container.training:main",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "hyperparameters": {
        "batch-size": 10000,
        "epochs": 1
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "test": {
            "RecordWrapperType": "None",
            "S3DistributionType": "FullyReplicated",
            "TrainingInputMode": "File"
        },
        "train": {
            "RecordWrapperType": "None",
            "S3DistributionType": "FullyReplicated",
            "TrainingInputMode": "File"
        }
    },
    "input_dir": "/opt/ml/input",
    "job_name": "preprod-chainer-2018-05-31-06-27-15-511",
    "log_level": 20,
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-{aws-region}-{aws-id}/{training-job-name}/source/sourcedir.tar.gz",
    "module_name": "user_script",
    "network_interface_name": "ethwe",
    "num_cpus": 4,
    "num_gpus": 1,
    "num_neurons": 1,
    "output_data_dir": "/opt/ml/output/data/algo-1",
    "output_dir": "/opt/ml/output",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1",
            "algo-2"
        ]
    }
}'

Provides all the training information as a JSON-encoded dictionary.