Output-Sensitive Adaptive Metropolis Hastings for Probabilistic Programs

Preliminaries

Probabilistic Program

• A program with random computations.
• Distributions are conditioned by `observations'.
• Values of certain expressions are `predicted' — the output.

(let [;; Model
        dist (sample (categorical [[normal 1/2] [gamma 1/2]]))
        a (sample (gamma 1 1))
        b (sample (gamma 1 1))
        d (dist a b)]
    ;; Observations
    (observe d 1) (observe d 2)
    (observe d 4) (observe d 7)
    ;; Explanation
    (predict :d (type d))
    (predict :a a) (predict :b b)))

Inference Objectives

• Suggest most probable explanation (MPE) - most likely assignment
• Approximately compute integral of the
• Continuously and infinitely generate a sequence of samples. ✓

Lighweight Metropolis-Hastings (LMH)

$\mathcal{P}$ — probabilistic program.
$\pmb{x}$ — random variables.
$\pmb{z}$ — output.

Run $\mathcal{P}$ once, remember $\pmb{x}, \pmb{z}$. Loop Uniformly select $x_i$. Propose a value for $x_i$. Run $\mathcal{P}$, remember $\pmb{x'}, \pmb{z'}$. Accept ($\pmb{x,z}=\pmb{x',z'}$) or reject with MH probability. Output $\pmb{z}$. end loop

Can we do better?

References

Christophe Andrieu and Johannes Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343–373, 2008.
B. Paige, F. Wood, A. Doucet, and Y.W. Teh. Asynchronous anytime sequential Monte Carlo. In NIPS-2014, to appear.
David Wingate, Andreas Stuhlmu ̈ller, and Noah D. Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In Proc. of AISTATS-2011.
Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to probabilistic programming inference. In AISTATS-2014.

Experiments

Convergence —Gaussian Process

$\begin{align*} f\sim&\mathcal{GP}(m,k),\\ \mbox{where }m(x)=&ax^2+bx+c,\quad k(x,x′)=de^{−\frac {(x - x')^2} {2g}}. \end{align*}$

1000 samples
10,000 samples
100,000 samples

Sample size — Kalman Smoother

$\begin{align*} \vec{x}_t &\sim {\rm Norm}(\vec{A} \cdot \vec{x}_{t-1}, \vec{Q}) , & \vec{y}_t &\sim {\rm Norm}(\vec{C} \cdot \vec{x}_{t}, \vec{R}) . \end{align*}$

$\begin{align*} \vec{A} &= \left[ \begin{array}{cc} ~\cos \omega~ & ~-\sin \omega~ \\ ~\sin \omega~ & ~\cos \omega~ \end{array} \right] , & \vec{Q} &= \left[ \begin{array}{cc} ~q~ & 0 \\ 0 & ~q~ \end{array} \right] . \end{align*}$

100 16-dimensional observations,
500 samples after 10,000 samples of burn-in.

Acknowledgments

This material is based on research sponsored by DARPA through the U.S. Air Force Research Laboratory under Cooperative Agreement number FA8750-14-2-0004.

Metropolis Hastings with Adaptive Scheduling

• Selects each $x_i$ with a different probability.
• Maintains vector of weights $\pmb{W}$ of random choices:

Initialize $\pmb{W}^0$ to a constant. Run $\mathcal{P}$ once. for $t = 1 \ldots \infty$ Select $x^t_i$ with probability $\alpha^t_i= {W_i^t} / \sum\limits_{i=1} ^{|\pmb{x}^t|} W_i^t$. Propose a value for $x^t_i$. Run $\mathcal{P}$, accept or reject with MH probability. if accepted Compute $\pmb{W}^{t+1}$ based on the program output. else $\pmb{W}^{t+1} \gets \pmb{W}^{t}$ end if end for

Quantifying the Influence

• Objective: faster convergence of program output $\pmb{z}$.
• Adaptation parameter: probabilities of selecting random choices for modification.
• Optimization target: maximize the change in the program output: $$R^t = \frac 1 {|\pmb{z}^t|}\sum_{k=1}^{|\pmb{z}^t|} \pmb{1}(z_k^t \ne z_k^{t-1}).$$

$W_i$ reflects the anticipated change in $\pmb{z}$ from modifying $x_i$.

Delayed Changes

Modifying x2 affects the output ...

(let [x1 (sample (normal 1 10))
              x2 (sample (normal x1 1))]
          (observe (normal x2 1) 2)
          (predict x1))

... but only when x1 is also modified.

Back-propagating rewards

• For each $x_i$, reward $r_i$ and count $c_i$ are kept.
• A history of modified random choices is attached to every $z_j$.

When modification of $x_k$ accepted:

Append $x_k$ to the history. if $\pmb{z}^{t+1} \ne \pmb{z}^{t}$ $w \gets \frac 1 {|history|}$ for $x_m$ in history $\overline r_m \gets r_m + w,\; c_m \gets c_m + w$ end for Flush the history. else $c_k \gets c_k + 1$ end if

Convergence:

For any partitioning of $\pmb{x}$, Adaptive LMH selects variables from each partition with non-zero probability.

Contributions

• A scheme of rewarding random samples based on program output.
• An approach to propagation of sample rewards to MH proposal scheduling parameters.
• An application of this approach to LMH, where the probabilities of selecting each variable for modification are adjusted.

Future Work

http://www.offtopia.net/nips-pp-ws-2014-ardb-poster/

• Extending the adaptation approach to other sampling methods.
• Reward scheme that takes into account the amount of difference between samples.
• Acquisition of dependencies between predicted expressions and random variables.

Output-Sensitive Adaptive Metropolis-Hastingsfor Probabilistic Programs