Hi,
Thank you for your awesome code (mcts.py).
However, I think the algorithm is not working according to the policy network. For example, in the link you gave, under section 2.1, it is stated that it aims to find the policy that yields the highest reward.
When I ran the algorithm, the nodes/child selected were not the highest rewards. The root child selected also did not have the highest reward. Does this mean that the policy network is not optimized, and therefore, did not choose the best action/search?
E.g. as follows:
level 0
Num Children: 4
(0, Node; children: 4; visits: 354; reward: 295.106667)
(1, Node; children: 4; visits: 928; reward: 826.746667) <--- This was selected by the algorithim
(2, Node; children: 4; visits: 582; reward: 504.422222)
(3, Node; children: 4; visits: 933; reward: 831.537778) <--- Shouldn't this be selected?
Best Child: Value: 20; Moves: [20]
The input parameters I ran was 10000 loops and 8 levels.
Thank you once again for your help. Hope to hear from you soon.
James
Hi,
Thank you for your awesome code (mcts.py).
However, I think the algorithm is not working according to the policy network. For example, in the link you gave, under section 2.1, it is stated that it aims to find the policy that yields the highest reward.
When I ran the algorithm, the nodes/child selected were not the highest rewards. The root child selected also did not have the highest reward. Does this mean that the policy network is not optimized, and therefore, did not choose the best action/search?
E.g. as follows:
level 0
Num Children: 4
(0, Node; children: 4; visits: 354; reward: 295.106667)
(1, Node; children: 4; visits: 928; reward: 826.746667) <--- This was selected by the algorithim
(2, Node; children: 4; visits: 582; reward: 504.422222)
(3, Node; children: 4; visits: 933; reward: 831.537778) <--- Shouldn't this be selected?
Best Child: Value: 20; Moves: [20]
The input parameters I ran was 10000 loops and 8 levels.
Thank you once again for your help. Hope to hear from you soon.
James