Merge branch 'master' into master

tensorflow · Aug 1, 2017 · af52f5f · af52f5f
2 parents 2ced78d + cd222d3
commit af52f5f
Show file tree

Hide file tree

Showing 99 changed files with 5,990 additions and 2,054 deletions.
diff --git a/README.md b/README.md
@@ -23,8 +23,25 @@ send along a pull request to add your dataset or model.
 See [our contribution
 doc](CONTRIBUTING.md) for details and our [open
 issues](https://github.com/tensorflow/tensor2tensor/issues).
-And chat with us and other users on
-[Gitter](https://gitter.im/tensor2tensor/Lobby).
+You can chat with us and other users on
+[Gitter](https://gitter.im/tensor2tensor/Lobby) and please join our
+[Google Group](https://groups.google.com/forum/#!forum/tensor2tensor) to keep up
+with T2T announcements.
+
+Here is a one-command version that installs tensor2tensor, downloads the data,
+trains an English-German translation model, and lets you use it interactively:
+```
+pip install tensor2tensor && t2t-trainer \
+  --generate_data \
+  --data_dir=~/t2t_data \
+  --problems=wmt_ende_tokens_32k \
+  --model=transformer \
+  --hparams_set=transformer_base_single_gpu \
+  --output_dir=~/t2t_train/base \
+  --decode_interactive
+```
+
+See the [Walkthrough](#walkthrough) below for more details on each step.
 
 ### Contents
 
@@ -69,11 +86,8 @@ mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
 t2t-datagen \
   --data_dir=$DATA_DIR \
   --tmp_dir=$TMP_DIR \
-  --num_shards=100 \
   --problem=$PROBLEM
 
-cp $TMP_DIR/tokens.vocab.* $DATA_DIR
-
 # Train
 # *  If you run out of memory, add --hparams='batch_size=2048' or even 1024.
 t2t-trainer \
@@ -153,7 +167,7 @@ python -c "from tensor2tensor.models.transformer import Transformer"
   specification.
 * Support for multi-GPU machines and synchronous (1 master, many workers) and
   asynchrounous (independent workers synchronizing through a parameter server)
-  distributed training.
+  [distributed training](https://github.com/tensorflow/tensor2tensor/tree/master/docs/distributed_training.md).
 * Easily swap amongst datasets and models by command-line flag with the data
   generation script `t2t-datagen` and the training script `t2t-trainer`.
 
@@ -173,8 +187,10 @@ and many common sequence datasets are already available for generation and use.
 
 **Problems** define training-time hyperparameters for the dataset and task,
 mainly by setting input and output **modalities** (e.g. symbol, image, audio,
-label) and vocabularies, if applicable. All problems are defined in
-[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py).
+label) and vocabularies, if applicable. All problems are defined either in
+[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py)
+or are registered with `@registry.register_problem` (run `t2t-datagen` to see
+the list of all available problems).
 **Modalities**, defined in
 [`modality.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/modality.py),
 abstract away the input and output data types so that **models** may deal with
@@ -211,7 +227,7 @@ inference. Users can easily switch between problems, models, and hyperparameter
 sets by using the `--model`, `--problems`, and `--hparams_set` flags. Specific
 hyperparameters can be overridden with the `--hparams` flag. `--schedule` and
 related flags control local and distributed training/evaluation
-([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/docs/distributed_training.md)).
+([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/docs/distributed_training.md)).
 
 ---
 
@@ -222,7 +238,7 @@ enables easily adding new ones and easily swapping amongst them by command-line
 flag. You can add your own components without editing the T2T codebase by
 specifying the `--t2t_usr_dir` flag in `t2t-trainer`.
 
-You can currently do so for models, hyperparameter sets, and modalities. Please
+You can do so for models, hyperparameter sets, modalities, and problems. Please
 do submit a pull request if your component might be useful to others.
 
 Here's an example with a new hyperparameter set:
@@ -242,7 +258,7 @@ def transformer_my_very_own_hparams_set():
 
 ```python
 # In ~/usr/t2t_usr/__init__.py
-import my_registrations
+from . import my_registrations
 ```
 
 ```
@@ -253,9 +269,18 @@ You'll see under the registered HParams your
 `transformer_my_very_own_hparams_set`, which you can directly use on the command
 line with the `--hparams_set` flag.
 
+`t2t-datagen` also supports the `--t2t_usr_dir` flag for `Problem`
+registrations.
+
 ## Adding a dataset
 
-See the [data generators
+To add a new dataset, subclass
+[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
+and register it with `@registry.register_problem`. See
+[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
+for an example.
+
+Also see the [data generators
 README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/README.md).
 
 ---

diff --git a/tensor2tensor/docs/distributed_training.md → docs/distributed_training.md b/tensor2tensor/docs/distributed_training.md → docs/distributed_training.md
@@ -10,52 +10,54 @@ along with a set of flags.
 
 ## `TF_CONFIG`
 
-Both workers and parameter servers must have the `TF_CONFIG` environment
+Both masters and parameter servers must have the `TF_CONFIG` environment
 variable set.
 
 The `TF_CONFIG` environment variable is a json-encoded string with the addresses
-of the workers and parameter servers (in the `'cluster'` key) and the
+of the masters and parameter servers (in the `'cluster'` key) and the
 identification of the current task (in the `'task'` key).
 
 For example:
 
 ```
 cluster = {
     'ps': ['host1:2222', 'host2:2222'],
-    'worker': ['host3:2222', 'host4:2222', 'host5:2222']
+    'master': ['host3:2222', 'host4:2222', 'host5:2222']
 }
 os.environ['TF_CONFIG'] = json.dumps({
     'cluster': cluster,
-    'task': {'type': 'worker', 'index': 1}
+    'task': {'type': 'master', 'index': 1},
+    'environment': 'cloud',
 })
 ```
 
 ## Command-line flags
 
-The following T2T command-line flags must also be set on the workers for
+The following T2T command-line flags must also be set on the masters for
 distributed training:
 
 - `--master=grpc://$ADDRESS`
-- `--worker_replicas=$NUM_WORKERS`
-- `--worker_gpu=$NUM_GPUS_PER_WORKER`
-- `--worker_id=$WORKER_ID`
+- `--worker_replicas=$NUM_MASTERS`
+- `--worker_gpu=$NUM_GPUS_PER_MASTER`
+- `--worker_id=$MASTER_ID`
+- `--worker_job='/job:master'`
 - `--ps_replicas=$NUM_PS`
 - `--ps_gpu=$NUM_GPUS_PER_PS`
 - `--schedule=train`
 - `--sync`, if you want synchronous training, i.e. for there to be a single
-  master worker coordinating the work across "ps" jobs (yes, the naming is
-  unfortunate). If not set, then each worker operates independently while
-  variables are shared on the parameter servers.
+  master coordinating the work across "ps" jobs. If not set, then each master
+  operates independently while variables are shared on the parameter servers.
 
-Parameter servers only need `--schedule=run_std_server`.
+Parameter servers only need `--master=grpc://$ADDRESS` and
+`--schedule=run_std_server`.
 
 ## Utility to produce `TF_CONFIG` and flags
 
-[`bin/make_tf_configs.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/make_tf_configs.py))
+[`t2t-make-tf-configs`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t-make-tf-configs))
 generates the `TF_CONFIG` json strings and the above-mentioned command-line
-flags for the workers and parameter servers.
+flags for the masters and parameter servers.
 
-Given a set of worker and parameter server addresses, the script outputs, for
+Given a set of master and parameter server addresses, the script outputs, for
 each job, a line with the `TF_CONFIG` environment variable and the command-line
 flags necessary for distributed training. For each job, you should invoke the
 `t2t-trainer` with the `TF_CONFIG` value and flags that are output.
@@ -66,6 +68,9 @@ For example:
 TF_CONFIG=$JOB_TF_CONFIG t2t-trainer $JOB_FLAGS --model=transformer ...
 ```
 
+Modify the `--worker_gpu` and `--ps_gpu` flags, which specify how many gpus are
+on each master and ps, respectively, as needed for your machine/cluster setup.
+
 ## Command-line flags for eval jobs
 
 Eval jobs should set the following flags and do not need the `TF_CONFIG`

diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,23 @@
+# T2T: Tensor2Tensor Transformers
+
+Check us out on
+<a href=https://github.com/tensorflow/tensor2tensor>
+GitHub
+<img src="https://github.com/favicon.ico" width="16">
+</a>
+.
+
+[![PyPI
+version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor)
+[![GitHub
+Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues)
+[![Contributions
+welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
+[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
+[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
+
+See our
+[README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/README.md)
+for documentation.
+
+More documentation and tutorials coming soon...
diff --git a/setup.py b/setup.py
@@ -5,14 +5,19 @@
 
 setup(
     name='tensor2tensor',
-    version='1.0.12',
+    version='1.1.3',
     description='Tensor2Tensor',
     author='Google Inc.',
     author_email='no-reply@google.com',
     url='http://github.com/tensorflow/tensor2tensor',
     license='Apache 2.0',
     packages=find_packages(),
-    scripts=['tensor2tensor/bin/t2t-trainer', 'tensor2tensor/bin/t2t-datagen'],
+    package_data={'tensor2tensor.data_generators': ['test_data/*']},
+    scripts=[
+        'tensor2tensor/bin/t2t-trainer',
+        'tensor2tensor/bin/t2t-datagen',
+        'tensor2tensor/bin/t2t-make-tf-configs',
+    ],
     install_requires=[
         'numpy',
         'sympy',
@@ -22,6 +27,8 @@
         'tensorflow': ['tensorflow>=1.2.0rc1'],
         'tensorflow_gpu': ['tensorflow-gpu>=1.2.0rc1'],
     },
+    tests_require=['nose'],
+    test_suite='nose.collector',
     classifiers=[
         'Development Status :: 4 - Beta',
         'Intended Audience :: Developers',

diff --git a/tensor2tensor/__init__.py b/tensor2tensor/__init__.py
@@ -1,4 +1,5 @@
-# Copyright 2017 Google Inc.
+# coding=utf-8
+# Copyright 2017 The Tensor2Tensor Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.