Merge branch 'release/1.3.0'

ronaldosvieira · Sep 19, 2022 · cf02be3 · cf02be3
2 parents 8fe6d14 + a60e6b4
commit cf02be3
Show file tree

Hide file tree

Showing 15 changed files with 555 additions and 438 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -19,5 +19,5 @@ license: MIT
 message: "If you use this software, please cite it as below."
 repository-code: "https://github.com/ronaldosvieira/gym-locm"
 title: "OpenAI Gym Environments for Legends of Code and Magic"
-version: "1.2.0"
+version: "1.3.0"
 ...
diff --git a/README.md b/README.md
@@ -243,8 +243,9 @@ engine, and with a specific random seed:
 ### Train draft agents with deep reinforcement learning
 
 We provide scripts to train deep reinforcement learning draft agents as described in our 
-thesis <a href="#vieira2020a">[2]</a> and paper <a href="#vieira2020b">[3]</a>. Further instructions are available in the README.md in 
-the [experiments](https://github.com/ronaldosvieira/gym-locm/tree/master/gym_locm/experiments) 
+thesis <a href="#vieira2020a">[2]</a> and SBGames 2020 paper <a href="#vieira2020b">[3]</a>. 
+Further instructions are available in the README.md file in 
+the [experiments](gym_locm/experiments) 
 package.
 
 To install the dependencies necessary to run the scripts, install 
@@ -253,16 +254,25 @@ the repository with
 pip install -e .['experiments']
 ```
 
-### Use trained draft agents
-
-We provide a collection of draft agents trained with deep 
+We also provide a collection of draft agents trained with deep 
 reinforcement learning, and a script to use them in the LOCM's original engine.
 Further details on these agents and instructions for the script are available in the
 README.md in the 
-[trained_models](https://github.com/ronaldosvieira/gym-locm/tree/master/gym_locm/trained_models) 
+[trained_models](gym_locm/trained_models) 
+package. The use of these draft agents with the Runner script is not implemented yet.
+
+### Train battle agents with deep reinforcement learning
+
+We provide scripts to train deep reinforcement learning battle agents as described in our
+SBGames 2022 paper <a href="#vieira2022a">[4]</a>. Further instructions are available
+in the README.md file in the [experiments/papers/sbgames-2022](gym_locm/experiments/papers/sbgames-2022)
 package.
 
-The use of these draft agents with the Runner script is not implemented yet.
+To install the dependencies necessary to run the scripts, install
+the repository with
+```python
+pip install -e .['experiments']
+```
 
 ## References
 1. <span id="kowalski2020">Kowalski, J., Miernik, R. (2020). Evolutionary 
@@ -276,5 +286,9 @@ of Minas Gerais, Belo Horizonte, Brazil.</span>
 Collectible Card Games via Reinforcement Learning. 19th Brazilian Symposium of Computer Games
 and Digital Entertainment (SBGames).</span>
 
+4. <span id="vieira2022a">Vieira, R., Tavares, A. R., Chaimowicz, L. (2022). Exploring Deep 
+   Reinforcement Learning for Battling in Collectible Card Games. 19th Brazilian Symposium 
+   of Computer Games and Digital Entertainment (SBGames).</span>
+
 ## License
 [MIT](https://choosealicense.com/licenses/mit/)
diff --git a/gym_locm/agents.py b/gym_locm/agents.py
@@ -1198,8 +1198,10 @@ def act(self, state):
     "pass": PassBattleAgent,
     "random": RandomBattleAgent,
     "greedy": GreedyBattleAgent,
+    "osl": GreedyBattleAgent,
     "rule-based": RuleBasedBattleAgent,
     "max-attack": MaxAttackBattleAgent,
+    "ma": MaxAttackBattleAgent,
     "coac": CoacBattleAgent,
     "mcts": MCTSBattleAgent
 }

diff --git a/gym_locm/envs/battle.py b/gym_locm/envs/battle.py
@@ -200,13 +200,15 @@ def get_episode_rewards(self):
 
 class LOCMBattleSingleEnv(LOCMBattleEnv):
     def __init__(self, battle_agent=RandomBattleAgent(),
-                 play_first=True, **kwargs):
+                 play_first=True, alternate_roles=False, **kwargs):
         # init the env
         super().__init__(**kwargs)
 
-        # also init the battle agent and the new parameter
+        # also init the battle agent and the new parameters
         self.battle_agent = battle_agent
         self.play_first = play_first
+        self.alternate_roles = alternate_roles
+        self.rewards_single_player = []
 
         # reset the battle agent
         self.battle_agent.reset()
@@ -216,6 +218,9 @@ def reset(self) -> np.array:
         Resets the environment.
         The game is put into its initial state and all agents are reset.
         """
+        if self.alternate_roles:
+            self.play_first = not self.play_first
+
         # reset what is needed
         encoded_state = super().reset()
 
@@ -227,6 +232,8 @@ def reset(self) -> np.array:
             while self.state.current_player.id != PlayerOrder.SECOND:
                 super().step(self.battle_agent.act(self.state))
 
+        self.rewards_single_player.append(0.0)
+
         return encoded_state
 
     def step(self, action):
@@ -253,17 +260,27 @@ def step(self, action):
         if not self.play_first:
             reward = -reward
 
+        try:
+            self.rewards_single_player[-1] += reward
+        except IndexError:
+            self.rewards_single_player = [reward]
+
         return state, reward, done, info
 
+    def get_episode_rewards(self):
+        return self.rewards_single_player
+
 
 class LOCMBattleSelfPlayEnv(LOCMBattleEnv):
-    def __init__(self, play_first=True, adversary_policy=None, **kwargs):
+    def __init__(self, play_first=True, alternate_roles=True, adversary_policy=None, **kwargs):
         # init the env
         super().__init__(**kwargs)
 
         # also init the new parameters
         self.play_first = play_first
         self.adversary_policy = adversary_policy
+        self.alternate_roles = alternate_roles
+        self.rewards_single_player = []
 
     def reset(self) -> np.array:
         """
@@ -273,8 +290,8 @@ def reset(self) -> np.array:
         # reset what is needed
         encoded_state = super().reset()
 
-        # also reset the battle agent
-        self.play_first = not self.play_first
+        if self.alternate_roles:
+            self.play_first = not self.play_first
 
         # if playing second, have first player play
         if not self.play_first:
@@ -288,6 +305,8 @@ def reset(self) -> np.array:
                     state, reward, done, info = super().step(0)
                     break
 
+        self.rewards_single_player.append(0.0)
+
         return encoded_state
 
     def step(self, action):
@@ -315,4 +334,12 @@ def step(self, action):
         if not self.play_first:
             reward = -reward
 
+        try:
+            self.rewards_single_player[-1] += reward
+        except IndexError:
+            self.rewards_single_player = [reward]
+
         return state, reward, done, info
+
+    def get_episode_rewards(self):
+        return self.rewards_single_player
diff --git a/gym_locm/envs/draft.py b/gym_locm/envs/draft.py
@@ -152,8 +152,6 @@ def step(self, action: Union[int, Action]) -> (np.array, int, bool, dict):
 
             done = True
 
-            del info['turn']
-
         if reward_before is None:
             raw_rewards = (0.0,) * len(self.reward_functions)
         else:

diff --git a/gym_locm/envs/full_game.py b/gym_locm/envs/full_game.py
@@ -107,8 +107,6 @@ def step(self, action):
         if winner is not None:
             reward = 1 if winner == PlayerOrder.FIRST else -1
 
-            del info['turn']
-
         return self.encode_state(), reward, done, info
 
     def _encode_state_battle(self):

diff --git a/gym_locm/experiments/README.md b/gym_locm/experiments/README.md
@@ -17,7 +17,7 @@ python3 gym_locm/experiments/hyp-search.py --approach <approach> --battle-agent
     --path hyp_search_results/ --seed 96765 --processes 4
 ```
 
-The list and range of hyperparameted explored is available in the Appendix of our paper and in Attachment A of 
+The list and range of hyperparameters explored is available in the Appendix of our paper and in Attachment A of 
 our thesis. we performed hyperparameter tunings for all combinations of `<approach>` (`immediate`, `history` 
 and `lstm`) and `<battle_agent>` (`max-attack` and `greedy`). Each run of the script took around 2 days with the
 `max-attack` battle agent and more than a week with the `greedy` battle agent. To learn about other script's 
@@ -37,7 +37,7 @@ python3 gym_locm/experiments/training.py --approach <approach> --battle-agent <b
 ```
 
 We trained 20 draft agents (ten 1st players and 2nd second players) of each combination of `<approach>` and 
-`<battle agent>`, using the best sets of hyperparameters found for them in the previous experiment. That comprises
+`<battle_agent>`, using the best sets of hyperparameters found for them in the previous experiment. That comprises
 ten runs of the script, in which we used the seeds 32359627, 91615349, 88803987, 83140551, 50731732, 19279988, 35717793, 
 48046766, 86798618 and 62644993.
 

diff --git a/gym_locm/experiments/papers/entcom-2022/README.md b/gym_locm/experiments/papers/entcom-2022/README.md
@@ -0,0 +1,109 @@
+# Reproducing the experiments from our Entertainment Computing 2022 paper
+
+This readme file contains the information necessary to reproduce the experiments
+from our paper in Entertainment Computing 2022 named "_Exploring Deep Reinforcement Learning for
+Drafting in Collectible Card Games_." Please contact
+me at [ronaldo.vieira@dcc.ufmg.br](mailto:ronaldo.vieira@dcc.ufmg.br) in case any
+of the instructions below do not work.
+
+The game engine for LOCM 1.2 can be found at [engine.py](../../../engine.py), which is used by the OpenAI 
+Gym environments (more info on the repository's main page). The implementation of our 
+approaches can be found in the experiment files mentioned below. The resulting agents can be found in the
+[trained_models](../../../trained_models) folder, along with instructions on how to use them.
+
+## Section 4.1: hyperparameter search
+
+To perform a hyperparameter tuning, simply execute the [hyp-search.py](../../../experiments/hyp-search.py) script:
+
+```
+python3 gym_locm/experiments/hyp-search.py --approach <approach> --battle-agent <battle_agent> \
+    --path hyp_search_results/ --seed 96765 --processes 4
+```
+
+The list and range of hyperparameters explored is available in the Appendix A of our paper. we performed 
+hyperparameter tunings for all combinations of `<approach>` (`immediate`, `history` and `lstm`) and 
+`<battle_agent>` (`ma` and `osl`). To learn about the other script's parameters, execute it with the 
+`--help` flag.
+
+## Section 4.2: comparison between our approaches
+
+To train **two** draft agents (a 1st player and a 2nd player) with a specific draft approach and battle agent,
+in asymmetric self-play, simply execute the [training.py](../../../experiments/training.py) script:
+
+```
+python3 gym_locm/experiments/training.py --approach <approach> --battle-agent <battle_agent> \
+    --path training_results/ --switch-freq <switch_freq> --layers <layers> --neurons <neurons> \
+    --act-fun <activation_function> --n-steps <batch_size> --nminibatches <n_minibatches> \
+    --noptepochs <n_epochs> --cliprange <cliprange> --vf-coef <vf_coef> --ent-coef <ent_coef> \
+    --learning-rate <learning_rate> --seed 32359627 --concurrency 4
+```
+
+We trained ten draft agents (five 1st players and five 2nd players) of each combination of `<approach>` and
+`<battle_agent>`, using the best sets of hyperparameters found for them in the previous experiment. That comprises
+five runs of the script, in which we used the seeds `32359627`, `91615349`, `88803987`, `83140551` and `50731732`.
+
+To learn about the other script's parameters, execute it with the `--help` flag. Running the script with all default
+parameters will train a `immediate` drafter with the `ma` battler, using the best set of hyperparameters
+we found for that combination. The best set of hyperparameters for the other combinations is available in the
+Appendix A of our paper.
+
+## Section 4.3: comparison with other draft agents
+
+To run one of the tournaments, simply execute the [tournament.py](../../../experiments/tournament.py) script:
+```
+python3 gym_locm/experiments/tournament.py \
+    --drafters random max-attack coac closet-ai icebox chad \
+        gym_locm/trained_models/<battle_agent>/immediate-1M/ \
+        gym_locm/trained_models/<battle_agent>/lstm-1M/ \
+        gym_locm/trained_models/<battle_agent>/history-1M/ \
+    --battler <battle_agent> --concurrency 4 --games 1000 --path tournament_results/ \
+    --seeds 32359627 91615349 88803987 83140551 50731732
+```
+replacing `<battle_agent>` for either `ma` or `osl`, respectively, to run either tournament as
+depicted in our paper. The script will create files at `tournament_results/` describing
+the individual win rates of every set of matches, the aggregate win rates, average mana curves (section 4.3.2) 
+and every individual draft choice made by every agent, in CSV format, for human inspection, and as serialized 
+Pandas data frames (PKL format), for easy further data manipulation. To learn about the other script's
+parameters, execute it with the `--help` flag.
+
+To reproduce the table of agent similarities and the plot containing the agent's three-dimensional coordinates 
+found via Principal Component Analysis and grouped via K-Means (section 4.3.3), simply execute the 
+[similarities.py](../../../experiments/similarities.py) script:
+```
+python3 gym_locm/experiments/similarities.py \
+  --files ma_tournament_results/choices.csv osl_tournament_results/choices.csv
+```
+which will result in files containing the similarities table (in CSV and PKL formats) and the plot (in PNG format)
+created to the current folder.
+
+## Section 4.4: agent improvement in the SCGAI competition
+
+We used the source code of the Strategy Card Game AI competition
+([2019](https://github.com/acatai/Strategy-Card-Game-AI-Competition/tree/master/contest-2019-08-COG) and
+[2020](https://github.com/acatai/Strategy-Card-Game-AI-Competition/tree/master/contest-2020-08-COG) editions)
+to re-run the matches, replacing the *max-attack* player (named Baseline2) with a personalized player featuring
+our best draft agent and the battle portion on the *max-attack* player. This can be reproduced by altering line
+11 (2019) or line 2 (2020) of the runner script (`run.sh`) from `AGENTS[10]="python3 Baseline2/main.py"` to
+```bash
+AGENTS[10]="python3 gym_locm/toolbox/predictor.py --battle \"python3 Baseline2/main.py\" \
+    --draft-1 path/to/gym_locm/trained_models/max-attack/immediate-1M/1st/6.json \
+    --draft-2 path/to/gym_locm/trained_models/max-attack/immediate-1M/2nd/8.json"
+```
+then, executing it. Parallelism can be achieved by running the script in multiple processes/machines. Save the
+output to text files named `out-*.txt` (with a number instead of `*`) in the same folder, then run `analyze.py`
+to extract win rates. The runner script can take up to several days, and the analyze script can take up to some hours.
+See the [trained_models](../../../trained_models) package for more information on the predictor script.
+
+## Section 4.5: importance of being history-aware in LOCM
+
+This experiment is simply a re-execution of the OSL tournament from section 4.2, adding a new draft agent to the 
+tournament (`historyless`). To reproduce it, execute the following script:
+```
+python3 gym_locm/experiments/tournament.py \
+    --drafters random max-attack coac closet-ai icebox chad historyless \
+        gym_locm/trained_models/<battle_agent>/immediate-1M/ \
+        gym_locm/trained_models/<battle_agent>/lstm-1M/ \
+        gym_locm/trained_models/<battle_agent>/history-1M/ \
+    --battler osl --concurrency 4 --games 1000 --path osl_historyless_tournament_results/ \
+    --seeds 32359627 91615349 88803987 83140551 50731732
+```
diff --git a/gym_locm/experiments/papers/sbgames-2022/README.md b/gym_locm/experiments/papers/sbgames-2022/README.md
@@ -0,0 +1,74 @@
+# Reproducing the experiments from our SBGames 2022 paper
+
+This readme file contains the information necessary to reproduce the experiments 
+from our paper in SBGames 2022 named "_Exploring Deep Reinforcement Learning for 
+Battling in Collectible Card Games_." Although we mention in the paper that we use 
+gym-locm's version 1.3.0, any future version should also suffice. Please contact 
+me at [ronaldo.vieira@dcc.ufmg.br](mailto:ronaldo.vieira@dcc.ufmg.br) in case any 
+of the instructions below do not work.
+
+## Experiment 1: hyperparameter search
+
+We use Weights and Biases (W&B) to orchestrate our hyperparameter search. The 
+`hyp-search.yaml` file contains the search configuration, including hyperparameter
+ranges. Having W&B installed, executing the following command on a terminal will
+create a "sweep" on W&B:
+
+```commandline
+wandb sweep gym_locm/experiments/sbgames-2022/hyp-search.yaml
+```
+
+This command will output a _sweep ID_, including the entity and project names. 
+Save it for the next step.
+From this moment on, the hyperparameter search can be observed on W&B's website.
+However, no training sessions will happen until you "recruit" one or more 
+computers to run the training sessions. That can be done by executing the following
+command on a terminal:
+
+```commandline
+wandb agent <sweep_id>
+```
+
+Where the `sweep_id` parameter should be the _sweep ID_ saved from the output of 
+the previous command. From now on, the recruited computers will run training sessions
+continuously until you tell them to stop. That can be done on W&B's website or by 
+issuing a CTRL + C on the terminal where the training sessions are being executed. 
+In our paper, we executed 35 training sessions. All the statistics can be seen on 
+W&B's website, including which sets of hyperparameters yielded the best results. 
+For more info on W&B sweeps, see [the docs](https://docs.wandb.ai/guides/sweeps).
+
+## Experiment 2: training in self-play
+
+Using the best set of hyperparameters found in the previous experiment, we executed 
+five training sessions, each with a different random seed. To reproduce the training 
+sessions we used for the paper, execute the following command on a terminal:
+
+```commandline
+python gym_locm/experiments/training.py --act-fun=relu --adversary=self-play \
+--cliprange=0.2 --concurrency=4 --draft-agent=random --ent-coef=0.005 \
+--eval-episodes=500 --gamma=0.99 --layers=7 --learning-rate=0.0041142387646692325 \
+--n-steps=512 --neurons=455 --nminibatches-divider=1 --noptepochs=1 --num-evals=100 \
+--path=gym_locm/experiments/papers/sbgames-2022/self-play --role=alternate \
+--seed=<seed> --switch-freq=10 --task=battle --train-episodes=100000 --vf-coef=1
+```
+
+Repeating five times, each with a different `seed` parameter. The seeds we used were:
+`91577453`, `688183`, `63008694`, `4662087`, and `58793266`. 
+
+## Experiment 3: training against a fixed battle agent
+
+This experiment uses almost the same command as the previous:
+
+```commandline
+python gym_locm/experiments/training.py --act-fun=relu --adversary=fixed \
+--battle-agent=<battle_agent> --cliprange=0.2 --concurrency=4 --draft-agent=random \
+--ent-coef=0.005 --eval-episodes=500 --gamma=0.99 --layers=7 \
+--learning-rate=0.0041142387646692325 --n-steps=512 --neurons=455 \
+--nminibatches-divider=1 --noptepochs=1 --num-evals=100 \
+--path=gym_locm/experiments/papers/sbgames-2022/fixed --role=alternate --seed=<seed> \
+ --switch-freq=10 --task=battle --train-episodes=100000 --vf-coef=1
+```
+
+Repeating ten times, each with a different combination of `battle_agent` and `seed` 
+parameters. The seeds we used were: `91577453`, `688183`, `63008694`, `4662087`, 
+and `58793266`. The battle agents we used were `max-attack` (MA) and `greedy` (OSL).