-
Notifications
You must be signed in to change notification settings - Fork 164
TonY Configurations
Junfan Zhang edited this page Feb 6, 2022
·
26 revisions
Name | Default | Meaning |
---|---|---|
tony.other.namenodes | Namenode URIs to get delegation tokens from. | |
tony.yarn.queue | default | Default queue to submit to YARN. |
tony.application.name | TensorFlowApplication | Name of your YARN application. |
tony.application.node-label | YARN partition which this application should run in. | |
tony.application.single-node | false | Whether this is single node training or not. |
tony.application.enable-preprocess | false | Whether the AM should invoke the user's python script or not. |
tony.application.timeout | 0 | Max runtime of the application before killing it, in milliseconds. |
tony.application.prepare-stage | n/a | Comma-separated list of task types that TonY will track and wait for to finish before starting tasks in the training stage. Task types that are defined in tony.application.untracked.jobtypes will not be tracked here. |
tony.application.training-stage | n/a | Comma-separated list of task types that TonY won't start until task types in prepare-stage have finished (with the exception of tony.application.untracked.jobtypes ). |
tony.application.untracked.jobtypes | ps | Comma-separated list of task types that TonY will not track and wait for to finish. Once all other task types have finished, the TonY AM will exit. |
tony.containers.resources | n/a | A list of resources to be localized to all containers, delimited by comma. If a resource has no scheme like hdfs:// or s3://, the file is considered a local file. You could add #archive annotation, if an entry has #archive, the file will be automatically unzipped when localized to the containers, folder name is the same as the file name. For example: /user/khu/abc.zip#archive would be inferred as a local file and will be unarchived in containers. You would anticipate an abc.zip/ folder in your container's working directory. Another notation :: is added since TonY 0.3.3. If you use PATH/TO/abc.txt::def.txt, the abc.txt file would be localized as def.txt in the container working directory. |
tony.containers.envs | n/a | Container environment setup before the tony starts |
tony.execution.envs | n/a | Shell environment setup before the actual job starts |
tony.application.stop-on-failure-jobtypes | ps | Comma-separated list of task types if that type of job instance failed then TonY short-circuits the training and immediately returns failure |
tony.application.fail-on-worker-failure-enabled | false | Whether TonY returns successful training with some failed workers. Setting this to true will short circuit and immediately stop training if a worker failed |
tony.application.hadoop.location | n/a | URI to the Hadoop dependency archive location. It should have the format in [scheme]://[host][path]#[fragment]. For example, hdfs://hostname/mapred/framework/hadoop-mapreduce-3.1.2.tar#mrframework. The fragment is optional and used as the alias of the localized file in the yarn container. For example, hdfs://ltx1-1234/mapred/framework/hadoop-mapreduce-3.1.2.tar#mrframework will be localized as mrframework rather than hadoop-mapreduce-3.1.2.tar. If this field is not configured, TonY will look up the archive defined in mapreduce.application.framework.path for the Hadoop dependency. If mapreduce.application.framework.path is not configured in mapred-site.xml, TonY will use the Hadoop dependency configured by Yarn. |
tony.application.hadoop.classpath | n/a | Classpath to use the Hadoop archive configured by tony.application.hadoop.location. Classpaths are separated by comma. If this field is not configured, TonY will use the classpath defined in mapreduce.application.classpath. If mapreduce.application.classpath is not configured in mapred-site.xml, TonY will use the classpath configured by Yarn. |
tony.application.group.X | n/a | Group to be used as the other tasks dependency. If this field is not configured, TonY will not use the tony.application.dependency.Y.timeout.after.X . Groups members are separated by comma, like tony.application.group.A=chief,worker |
tony.application.dependency.Y.timeout.after.X | n/a | Timeout(sec) to be used when configured task job type's(Y) dependent group(X) all jobs finished. If it reached timeout, TonY will make job fail. For example, sometimes due to tensorflow bug, the workers will hang after the chief finished. We could use config like tony.application.group.A = chief and tony.application.dependency.worker.timeout.after.A = 3600 . The worker will be alive 3600(sec) after chief finished. So both need to be used in combination. |
tony.application.dependency.Y.timeout.after.X.ignored | false | Whether the training job will exit once dependency times out. If true, it will mark the task role of Y untracked. |
Name | Default | Meaning |
---|---|---|
tony.task.max-total-instances | -1 | Maximum number of tasks (of all types) that can be requested. -1 means no limit. |
tony.task.max-total-gpus | -1 | Maximum number of GPUs that can be requested across all task types. -1 means no limit. |
tony.task.executor.jvm.opts | -Xmx1536m | JVM opts for each TaskExecutor. |
tony.task.heartbeat-interval-ms | 1000 | Frequency, in milliseconds, for which TaskExecutors should heartbeat with AM. |
tony.task.max-missed-heartbeats | 25 | How many missed heartbeats before declaring a TaskExecutor dead. |
tony.task.metrics-interval-ms | 5000 | Frequency, in milliseconds, at which TaskExecutors will report metrics to the AM. |
Name | Default | Meaning |
---|---|---|
tony.am.retry-count | 0 | How many times a failed AM should retry. On retry, all tasks (workers, ps, etc.) will be relaunched. |
tony.am.memory | 2g | AM memory size, requested as a string (e.g. '2g' or '2048m'). |
tony.am.vcores | 1 | Number of AM vcores to use. |
tony.am.gpus | 0 | Number of AM GPUs to use. (In general, should only be applicable in single node mode.) |
Name | Default | Meaning |
---|---|---|
tony.X.instances | 1 | Number of tasks for TensorFlowJob "X", default 1 if X=ps or X=worker, 0 otherwise. |
tony.X.memory | 2g | Memory size per task in TensorFlow job "X", requested as a string (e.g. '2g' or '2048m'). |
tony.X.vcores | 1 | Number of vcores per task in TensorFlow job "X". |
tony.X.gpus | 0 | Number of GPUs per task in TensorFlow job "X". |
tony.X.resources | n/a | A list of resources to be localized to all containers running "X" jobtype, delimited by comma. If a resource has no scheme like hdfs:// or s3://, the file is considered a local file. You could add #archive annotation, if an entry has #archive, the file will be automatically unzipped when localized to the containers, folder name is the same as the file name. For example: /user/khu/abc.zip#archive would be inferred as a local file and will be unarchived in containers. You would anticipate an abc.zip/ folder in your container's working directory. Another notation :: is added since TonY 0.3.3. If you use PATH/TO/abc.txt::def.txt, the abc.txt file would be localized as def.txt in the container working directory. |
tony.X.command | tony.containers.command | The command to run for the task type "X". By default, all tasks will run the same command tony.containers.command , which gets set automatically by TonyClient if you specify an -executes argument. |
TonY determines which TensorFlow job types to allocate based on configurations of the form "tony.X.instances". For each job "X", it will also search for resource requests corresponding to this TensorFlow job.
For example, you can configure a ps, worker, and chief job via:
<configuration>
<property>
<name>tony.worker.instances</name>
<value>4</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.worker.instances</name>
<value>4</value>
</property>
<property>
<name>tony.worker.memory</name>
<value>4g</value>
</property>
<property>
<name>tony.worker.resources</name>
<value>hdfs://namenode:9000/user/tony/hello.py</value>
</property>
<property>
<name>tony.ps.memory</name>
<value>3g</value>
</property>
<property>
<name>tony.chief.instances</name>
<value>1</value>
</property>
<property>
<name>tony.chief.memory</name>
<value>6g</value>
</property>
<property>
<name>tony.chief.gpus</name>
<value>1</value>
</property>
</configuration>
Note that TonY will configure default one ps and one worker and no other TensorFlow jobs (in this case, there will be four workers allocated since this is explicitly configured, and one ps since "tony.ps.instances" is omitted). Furthermore TonY will also configure one chief task since "tony.chief.instances" is configured to 1, and this task will have 6 GB and 1 GPU allocated for it.
Name | Default | Meaning |
---|---|---|
tony.application.security.enabled | true | Whether this application is running in a Kerberized grid. Setting this to true will fetch tokens from the cluster as well as between the client and AM. |
tony.application.hdfs-conf-path | Path to HDFS configuration, to be passed as an environment variable to the python training scripts. | |
tony.application.yarn-conf-path | Path to YARN configuration, to be passed as an environment variable to the python training scripts. | |
tony.application.mapred-conf-path | Path to MapReduce configuration, to be passed as an environment variable to the python training scripts. |