David Tolpin, Jan Willem van de Meent,
Brooks Paige, Frank Wood

University of Oxford
Oxford, UK

Output-Sensitive Adaptive Metropolis-Hastings
for Probabilistic Programs


Preliminaries

Probabilistic Program

(let [;; Model
        dist (sample (categorical [[normal 1/2] [gamma 1/2]]))
        a (sample (gamma 1 1))
        b (sample (gamma 1 1))
        d (dist a b)]
    ;; Observations
    (observe d 1) (observe d 2)
    (observe d 4) (observe d 7)
    ;; Explanation
    (predict :d (type d))
    (predict :a a) (predict :b b)))

Inference Objectives

Lighweight Metropolis-Hastings (LMH)

Run $\mathcal{P}$ once, remember $\pmb{x}, \pmb{z}$. Loop Uniformly select $x_i$. Propose a value for $x_i$. Run $\mathcal{P}$, remember $\pmb{x'}, \pmb{z'}$. Accept ($\pmb{x,z}=\pmb{x',z'}$) or reject with MH probability. Output $\pmb{z}$. end loop

Can we do better?

References

  1. Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343–373, 2008.
  2. B. Paige, F. Wood, A. Doucet, and Y.W. Teh. Asynchronous anytime sequential Monte Carlo. In NIPS-2014, to appear.
  3. David Wingate, Andreas Stuhlmu ̈ller, and Noah D. Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In Proc. of AISTATS-2011.
  4. Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to probabilistic programming inference. In AISTATS-2014.

Experiments

Convergence —Gaussian Process

$\begin{align*} f\sim&\mathcal{GP}(m,k),\\ \mbox{where }m(x)=&ax^2+bx+c,\quad k(x,x′)=de^{−\frac {(x - x')^2} {2g}}. \end{align*}$


1000 samples
10,000 samples
100,000 samples

Sample size — Kalman Smoother

$\begin{align*} \vec{x}_t &\sim {\rm Norm}(\vec{A} \cdot \vec{x}_{t-1}, \vec{Q}) , & \vec{y}_t &\sim {\rm Norm}(\vec{C} \cdot \vec{x}_{t}, \vec{R}) . \end{align*}$

$\begin{align*} \vec{A} &= \left[ \begin{array}{cc} ~\cos \omega~ & ~-\sin \omega~ \\ ~\sin \omega~ & ~\cos \omega~ \end{array} \right] , & \vec{Q} &= \left[ \begin{array}{cc} ~q~ & 0 \\ 0 & ~q~ \end{array} \right] . \end{align*}$

100 16-dimensional observations,
500 samples after 10,000 samples of burn-in.

Acknowledgments

This material is based on research sponsored by DARPA through the U.S. Air Force Research Laboratory under Cooperative Agreement number FA8750-14-2-0004.

Metropolis Hastings with Adaptive Scheduling

Initialize $\pmb{W}^0$ to a constant. Run $\mathcal{P}$ once. for $t = 1 \ldots \infty$ Select $x^t_i$ with probability $\alpha^t_i= {W_i^t} / \sum\limits_{i=1} ^{|\pmb{x}^t|} W_i^t$. Propose a value for $x^t_i$. Run $\mathcal{P}$, accept or reject with MH probability. if accepted Compute $\pmb{W}^{t+1}$ based on the program output. else $\pmb{W}^{t+1} \gets \pmb{W}^{t}$ end if end for

Quantifying the Influence

$W_i$ reflects the anticipated change in $\pmb{z}$ from modifying $x_i$.

Delayed Changes

Modifying x2 affects the output ...

(let [x1 (sample (normal 1 10))
              x2 (sample (normal x1 1))]
          (observe (normal x2 1) 2)
          (predict x1))

... but only when x1 is also modified.

Back-propagating rewards

When modification of $x_k$ accepted:

Append $x_k$ to the history. if $\pmb{z}^{t+1} \ne \pmb{z}^{t}$ $w \gets \frac 1 {|history|}$ for $x_m$ in history $\overline r_m \gets r_m + w,\; c_m \gets c_m + w$ end for Flush the history. else $c_k \gets c_k + 1$ end if

Convergence:

For any partitioning of $\pmb{x}$, Adaptive LMH selects variables from each partition with non-zero probability.

Contributions

Future Work

http://www.offtopia.net/nips-pp-ws-2014-ardb-poster/