Deployable probabilistic programming

David Tolpin, BGU & PUB+

press Space to move through slides

Probabilistic Programming

Programs as statistical models:


1 var breastCancer = flip(0.01)
2 var benignCyst = flip(0.2)
3 var positiveMammogram = (breastCancer && flip(0.8))
4                       || (benignCyst && flip(0.5))
5 condition(positiveMammogram)
6 return breastCancer

In the example: breast cancer probability given positive mammogram.

Probabilistic Programs

Compute posterior distributions.
May contain conditions, loops, recursions.
Are (mostly) deterministic!
Must be run `backwards'.
(Approximate) inference algorithms are required.

Probabilistic programming languages

45 PPLs listed on Wikipedia.
18 PPLs participated in developer meetup at PROBPROG 2018.
8 of the latter are not on Wikipedia (yet).

Probabilistic programming

Two polar views:

Framework for statistical modelling
Inferentiable programming — programs with inferrable parameters

Challenges

Simulation vs. inference
Data
Deployment

Simulation vs. inference


data {
    int N;
    vector[N] y;
    vector[N] x;
}

parameters {
    real alpha; real beta;
    real sigma;
}

model {
    alpha ~ normal(0,10);    
    beta ~ normal(0,10);
    sigma ~ cauchy(0,5);

    for (n in 1:N)
        y[n] ~ normal(
          alpha + beta * x[n],
          sigma);
}

Data structures

I mean:

I mean:	`def update_beliefs(beliefs, i, j, bandwidth): beliefs[i][j] += 1 evidence = sum(beliefs[i]) if evidence > bandwidth: for j in range(len(beliefs[i])): beliefs[i][j] *= bandwidth / evidence`
I write:	`def update_beliefs(beliefs, i, j, bandwidth): update = torch.zeros(beliefs.shape) update[i, j] = 1 beliefs = beliefs + update evidence = beliefs[i, :].sum() if evidence > bandwidth: scale = torch.ones(beliefs.shape) scale[i, :] = bandwidth / evidence beliefs = beliefs * scale return beliefs`


def update_beliefs(beliefs, i, j, bandwidth):
    beliefs[i][j] += 1
    evidence = sum(beliefs[i])
    if evidence > bandwidth:
        for j in range(len(beliefs[i])):
            beliefs[i][j] *= bandwidth / evidence

I write:


def update_beliefs(beliefs, i, j, bandwidth):
    update = torch.zeros(beliefs.shape)
    update[i, j] = 1
    beliefs = beliefs + update
    evidence = beliefs[i, :].sum()
    if evidence > bandwidth:
        scale = torch.ones(beliefs.shape)
        scale[i, :] = bandwidth / evidence
        beliefs = beliefs * scale
    return beliefs

Deployment

PyStan (70Mb)	Stan, C++ compiler, Cython, OCaml?
Pyro (>600Mb)	Successfully installed contextlib2-0.5.5 decorator-4.3.0 graphviz-0.10.1 networkx-2.2 opt-einsum-2.3.2 pyro-ppl-0.3.0 six-1.12.0 torch-1.0.0 tqdm-4.29.1
Turing.jl (50Mb):	Installed NNlib ─────────── v0.4.3 Installed Colors ────────── v0.9.5 Installed Arpack ────────── v0.3.0 Installed BinaryProvider ── v0.5.3 Installed ForwardDiff ───── v0.10.1 Installed ColorTypes ────── v0.7.5 Installed ProgressMeter ─── v0.9.0 Installed ZipFile ───────── v0.8.0 Installed StaticArrays ──── v0.10.2 Installed Juno ──────────── v0.5.3 ... (~40 packages)

PyStan (70Mb)

Stan, C++ compiler, Cython, OCaml?

Pyro (>600Mb)

Successfully installed contextlib2-0.5.5
decorator-4.3.0 graphviz-0.10.1 networkx-2.2
opt-einsum-2.3.2 pyro-ppl-0.3.0 six-1.12.0
torch-1.0.0 tqdm-4.29.1

Turing.jl (50Mb):

Installed NNlib ─────────── v0.4.3
Installed Colors ────────── v0.9.5
Installed Arpack ────────── v0.3.0
Installed BinaryProvider ── v0.5.3
Installed ForwardDiff ───── v0.10.1
Installed ColorTypes ────── v0.7.5
Installed ProgressMeter ─── v0.9.0
Installed ZipFile ───────── v0.8.0
Installed StaticArrays ──── v0.10.2
Installed Juno ──────────── v0.5.3
... (~40 packages)

Guidelines

One language
Common data structures
Inference code re-used in simulation

Probabilistic programming
in Go

Go is

small,
expressive,
efficient,
popular for server-side programming.

'hello world': model


 1  type Model struct {
 2      Data []float64
 3  }
 4  
 5  // x[0] is the mean, x[1] is the 
 6  // log stddev of the distribution
 7  func (m *Model) Observe(x []float64) float64 {
 8    // Our prior is a unit normal ...
 9    ll := Normal.Logps(0, 1, x...)
10    // ... but the posterior is based on data observations.
11    ll += Normal.Logps(x[0], math.Exp(x[1]), m.Data...)
12    return ll
13  }

'hello world': data


 1  m := &Model{[]float64{
 2    -0.854, 1.067, -1.220, 0.818, -0.749,
 3    0.805, 1.443, 1.069, 1.426, 0.308}}

'hello world': optimization


 1  x := []float64{0, 0}
 2    
 3  opt := &infer.Momentum{
 4    Rate:  0.01,
 5    Decay: 0.998,
 6  }
 7  for iter := 0; iter != 1000; iter++ {
 8    opt.Step(m, x)
 9  }
10  mean, logs = x[0], x[1]

'hello world': posterior


 1  x := []float64{0, 0}
 2    
 3  hmc := &infer.HMC{
 4    Eps: 0.1,
 5  }
 6  samples := make(chan []float64)
 7  hmc.Sample(m, x, samples)
 8  for i := 0; i != 1000; i++ {
 9    x = <-samples
10  }
11  hmc.Stop()

'hello world': streaming


 1  type Model struct {
 2    Data chan float64  // data is a channel
 3    N    int           // batch size
 4  }
 5  
 6  func (m *Model) Observe(x []float64) float64 {
 7    ll := Normal.Logps(0, 1, x...)
 8    // observe a batch of data from the channel
 9    for i := 0; i != m.N; i++ {
10      ll += Normal.Logp(x[0], math.Exp(x[1]), <- m.Data)
11    }
12    return ll
13  }

Why Go?

Comes with parser and type checker.
Compiles and runs fast.
Allows efficient parallel execution, via goroutines.

Infergo

Models are written in Go.
Relies on automatic differentiation for inference.
Works anywhere where Go does.
No external dependencies.
Licensed under the MIT license.

Model

Model interface:


1 type Model interface {
2   Observe(parameters []float64) float64
3 }

A model (exponential distribution):


1 type Expon float64 
2 
3 func (m Expon) Observe(x []float64) float64 {
4   return -x*m
5 }

Distributions

A distribution is a model:


 1  var Normal normal
 2
 3  // Observe implements the Model interface.
 4  func (dist normal) Observe(x []float64) float64 {
 5    mu, sigma, y := x[0], x[1], x[2:]
 6    return dist.Logps(mu, sigma, y...)
 7  }

Distributions (cont.)


 8  // Logp computes the log pdf of a single observation.
 9  func (_ normal) Logp(mu, sigma float64, y float64)
10      float64 {
11    ...
12  }
13
14  // Logps computes the log pdf of a vector of observations.
15  func (_ normal) Logps(mu, sigma float64, y ...float64)
16      float64 {
17    ...
18  }

Automatic differentiation

Numeric differentiation: $\frac {df} {dx} \approx \frac {f (x + \Delta x) - f(x)} {\Delta x}$
Symbolic differentiation: $\frac {duv} {dx} = u\frac{dv} {dx} + v \frac{du} {dx}$
Algorithmic differentiation
- Operator overloading.
- Source code transformation.

Differentiation in Infergo

Model methods returning float64 or nothing are differentiated.
Within the methods, the following is differentiated:
- assignments to float64;
- returns of float64;
- standalone calls to differentiated model methods.

Differentiation in Infergo

Reverse-mode via source code transformation:


1  func (m *Model) Observe(x []float64) float64 {
2    var ll float64
3    ad.Assignment(&ll, ad.Call(func(_ []float64) {
4      Normal.Logps(0, 0, x...)
5    }, 2, ad.Value(0), ad.Value(1)))
6    ad.Assignment(&ll, 
7      ad.Arithmetic(ad.OpAdd, &ll,
8        ad.Call(func(_ []float64) {
9          Normal.Logps(0, 0, m.Data...)
10        }, 2, &x[0], ad.Elemental(math.Exp, &x[1]))))
11    return ad.Return(&ll)
12  }

Differentiation in Infergo

Tape


 1  type adTape struct {
 2      records    []record    // recorded instructions
 3      places     []*float64  // variable places
 4      values     []float64   // stored values
 5      elementals []elemental // gradients of elementals
 6      cstack     []counters  // counter stack (see below)
 7  }

Differentiation in Infergo

Tape


 1  func Arithmetic(op int, px ...*float64) *float64 {
 2      tape := tapes.get()
 3      // Register
 4      p := Value(0)
 5      r := record{
 6          typ: typArithmetic,
 7          op:  op,
 8          p:   len(tape.places),
 9      }
10      tape.places = append(tape.places, p)
11      tape.places = append(tape.places, px...)
12      tape.records = append(tape.records, r)
13      // Run
14      switch op {
15      case OpNeg:
16          *p = -*px[0]
17          //  ...
18      }
19      return p
20  }

Inference

Optimization via gradient ascent (Momentum, Adam), works with streamed and stochastic data.
Full posterior inference via HMC variants. Samples are produced concurrently and passed through a channel.
Third-party inference algorithms.

Third-party inference

Straightforward because Infergo uses built-in Go float64.

Gonum (http://gonum.org/) library for numerical computation:


func FuncGrad(m Model) (
Func func(x []float64) float64,
Grad func(grad []float64, x []float64))

Examples

Simple models are still simple.
Complex models are less complex.

8 schools

Stan Infergo

Stan	Infergo
`data { int<lower=0> J; vector[J] y; vector<lower=0>[J] sigma; } parameters { real mu; real<lower=0> tau; vector[J] eta; } transformed parameters { vector[J] theta; theta = mu + tau * eta; } model { eta ~ normal(0, 1); y ~ normal(theta, sigma); }`	`type Model struct { J int Y []float64 Sigma []float64 } func (m Model) Observe(x []float64) float64 { mu := x[0] tau := math.Exp(x[1]) eta := x[2:] ll := Normal.Logp(0, 1, eta) for i, y := range m.Y { theta := mu + taueta[i] ll += Normal.Logp(theta, m.Sigma[i], y) } return ll }`


data {
int<lower=0> J;
vector[J] y;
vector<lower=0>[J] sigma;
}

parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}

transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}

model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
}


type Model struct {
J          int
Y          []float64
Sigma      []float64
}

func (m *Model) Observe(x []float64) float64 {
mu := x[0]
tau := math.Exp(x[1])
eta := x[2:]

ll := Normal.Logp(0, 1, eta)
for i, y := range m.Y {
theta := mu + tau*eta[i]
ll += Normal.Logp(theta, m.Sigma[i], y)
}
return ll
}

Linear regression


type Model struct {
Data  [][]float64
}

func (m *Model) Observe(x []float64)
float64 {
ll := 0.

alpha, beta := x[0], x[1]
sigma := math.Exp(x[2])

for i := range m.Data {
ll += Normal.Logp(
  m.Simulate(m.Data[i][0], alpha, beta),
  sigma, m.Data[i][1])
}
...


...
return ll
}

// Simulate predicts y for x based on
// inferred parameters.
func (m *Model) Simulate(x, alpha, beta float64)
float64 {
y := alpha +beta*x
return y
}

Latent Dirichlet allocation

Infergo (0.5s|8.9s) Stan (54s|3.7s)

Infergo (0.5s\|8.9s)	Stan (54s\|3.7s)
type LDAModel struct { K int // num topics V int // num words M int // num docs N int // total word instances Word []int // word n Doc []int // doc for word n Alpha []float64 // topic prior Beta []float64 // word prior } func (m *LDAModel) Observe(x []float64) float64 { ll := Normal.Logps(0, 1, x...) theta := make([][]float64, m.M) D.Simplices(&x, m.K, theta) phi := make([][]float64, m.K) D.Simplices(&x, m.V, phi) // priors ll += Dirichlet{m.K}.Logps(m.Alpha, theta...) ll += Dirichlet{m.V}.Logps(m.Beta, phi...) gamma := make([]float64, m.K) for in := 0; in != m.N; in++ { for ik := 0; ik != m.K; ik++ { gamma[ik] = math.Log(theta[m.Doc[in]-1][ik]) + math.Log(phi[ik][m.Word[in]-1]) } ll += D.LogSumExp(gamma) } return ll }	data { int K; // num topics int V; // num words int M; // num docs int N; // total word instances int w[N]; // word n int doc[N]; // doc for word n vector[K] alpha; // topic prior vector[V] beta; // word prior } parameters { simplex[K] theta[M]; // topic dist for doc m simplex[V] phi[K]; // word dist for topic k } model { for (m in 1:M) theta[m] ~ dirichlet(alpha); // prior for (k in 1:K) phi[k] ~ dirichlet(beta); // prior for (n in 1:N) { real gamma[K]; for (k in 1:K) gamma[k] <- log(theta[doc[n],k]) + log(phi[k,w[n]]); increment_log_prob(log_sum_exp(gamma)); } }


type LDAModel struct {
K     int       // num topics
V     int       // num words
M     int       // num docs
N     int       // total word instances
Word  []int     // word n
Doc   []int     // doc for word n
Alpha []float64 // topic prior
Beta  []float64 // word prior
}

func (m *LDAModel) Observe(x []float64) float64 {
ll := Normal.Logps(0, 1, x...)
theta := make([][]float64, m.M)
D.Simplices(&x, m.K, theta)
phi := make([][]float64, m.K)
D.Simplices(&x, m.V, phi)

// priors
ll += Dirichlet{m.K}.Logps(m.Alpha, theta...)
ll += Dirichlet{m.V}.Logps(m.Beta, phi...)

gamma := make([]float64, m.K)
for in := 0; in != m.N; in++ {
for ik := 0; ik != m.K; ik++ {
  gamma[ik] = math.Log(theta[m.Doc[in]-1][ik]) +
    math.Log(phi[ik][m.Word[in]-1])
}
ll += D.LogSumExp(gamma)
}
return ll
}


data {
int K;               // num topics
int V;               // num words
int M;               // num docs
int N;           // total word instances
int w[N];    // word n
int doc[N];  // doc for word n
vector[K] alpha;     // topic prior
vector[V] beta;      // word prior
}


parameters {
simplex[K] theta[M]; // topic dist for doc m
simplex[V] phi[K];   // word dist for topic k
}


model {
for (m in 1:M)  
theta[m] ~ dirichlet(alpha);  // prior
for (k in 1:K)  
phi[k] ~ dirichlet(beta);     // prior

for (n in 1:N) {
real gamma[K];
for (k in 1:K) 
  gamma[k] <- log(theta[doc[n],k]) 
    + log(phi[k,w[n]]);
increment_log_prob(log_sum_exp(gamma));
}
}

Acknowledgements

Frank Wood introduced me to probabilistic programming.
Jan-Willem van de Meent discussed with me motives, ideas, and implementation choices behind Infergo.
PUB+ supported me in development of Infergo.

Thank you!

Deployable probabilistic programming

Probabilistic Programming

Probabilistic Programs

Probabilistic programming languages

Probabilistic programming

Challenges

Simulation vs. inference

Data structures

Deployment

Guidelines

Probabilistic programmingin Go

'hello world': model

'hello world': data 1 m := &Model{[]float64{ 2 -0.854, 1.067, -1.220, 0.818, -0.749, 3 0.805, 1.443, 1.069, 1.426, 0.308}}

'hello world': optimization

'hello world': posterior

'hello world': streaming

Why Go?

Infergo

Model

Distributions

Distributions (cont.)

Automatic differentiation

Differentiation in Infergo

Differentiation in Infergo

Differentiation in Infergo

Tape

Differentiation in Infergo

Tape

Inference

Third-party inference

Examples

8 schools

Linear regression

Latent Dirichlet allocation

Acknowledgements

Thank you!

Probabilistic programming
in Go

'hello world': data
`1 m := &Model{[]float64{ 2 -0.854, 1.067, -1.220, 0.818, -0.749, 3 0.805, 1.443, 1.069, 1.426, 0.308}}`