David Tolpin, Solomon Eyal Shimony
Ben Gurion University of the Negev,
Beer Sheva, Israel

VOI-aware MCTS
ECAI 2012

Multi-armed Bandits

Regret minimization:





Multi-armed Bandits

25 arms, 10000 trials

Computer Go

Tuning the Sample Cost

Best results for sample cost $\approx 10^{-6}$:
winning rate of 64% for 10000 samples per ply.

Winning rate vs. number of samples

Sample cost fixed at $10^{-6}$:

Best results for intermediate $N_{samples}$:

Monte-Carlo Sampling in Trees

Main Results

Hybrid sampling scheme

  1. At the root node: sample based on the VOI estimate.
  2. At non-root nodes: sample using UCT.

Upper Bounds on VOI

Upper bounds on intrinsic VOI $\Lambda^b_i$ of testing the $i$th arm N times:

$$\Lambda_\alpha^b < \frac {N\overline X_\beta^{n_\beta}}{n_\alpha+1}\cdot 2\exp\left(- 1.37(\overline X_\alpha^{n_\alpha}\hspace{-0.5em} - \overline X_\beta^{n_\beta})^2 n_\alpha\right)$$ $$\Lambda_{i|i\ne\alpha}^b < \frac {N(1-\overline X_\alpha^{n_\alpha})} {n_i+1}\cdot 2\exp\left(- 1.37(\overline X_\alpha^{n_\alpha}\hspace{-0.5em} - \overline X_i^{n_i})^2 n_i\right)$$

Sample Redistribution

MCTS re-uses rollouts generated at earlier search states.

  1. Estimate VOI as though the information is discarded.
  2. Stop early if the VOI is below a certain threshold.
  3. Save the unused sample budget for search in future states.

The cost of a sample is the VOI of increasing a future budget by one sample.


Future Work