I attended DARPA's Probabilistic Programming for Advancing Machine Learning (PPAML) summer school. The aim of Probabilistic Programming languages (PPL) is to abstract away the act of Bayesian inference into modular engines such that switching from say Hamiltonian Monte Carlo to a Particle Filter requires changing exactly one string. If you can write a model in a PPL, you get inference for free [1].
The purported aim is to allow machine learning code that today requires 1000-10,000 lines of code to be written in 10-100 lines.
This works brilliantly for single shot learning. Let's say you are trying to teach a computer to recognize a handwritten character after a single example. First, you build a generative model that follows how letters are constructed: hand touches paper and makes a primitive shape (line, curve, loop, etc.). Multiple such primitives are strung together to form a character. For each example characters, infer the most likely sequence of such primitives. When a new classification is requested, take samples from the generative model for each known character. Calculate the difference in pixel value, and do this hundreds of thousands of times. You can now construct an accurate marginal probability for each known character while only needing a single example.
Very interesting and I haven't read much about the topic (yet) so excuse my pre-morningcoffee naive comment. I know very little about languages like Stan etc. but what's the reasoning of baking these "inference for free" ideas into the language. It sounds like something for a library. Are there distinct advantages to having a new language for these things?
My very naive approach would be using something like the pipelines in Phoenix (pipe_through :hamiltonian)
I suppose the reasoning is the same as for Prolog?
Stan (http://mc-stan.org/documentation/) is arguably the most advanced language. It's especially pushing the bounds of doing automatic variational inference, for the scenario where your model does not have a nice conjugate form that would be amenable to Gibbs sampling. It's not quite reached what I would say is production-quality, but some of the best people in the world of computational Bayesian methods (e.g., Michael Betancourt, most of David Blei's lab, etc.) are working on it.
Yes Stan is awesome! The main difference between something like Stan and "next gen" languages like Anglican and Webppl, is there are basically no restrictions in where you use a distribution. Nested inference, probabilistic recursion, etc are all possible. For certain classes of problems this leads to greatly enhanced expressiveness. On the flip side, Stan is more production ready right now
A major downside of Stan is its lack of support for discrete priors. This isn't really advertised very well, but is more of a problem than it might sound initially. Its type handling also can get a little frustrating at times. Overall, I highly recommend it but it does have its downsides, and there's some room for alternatives or improvement.
PyMC3 (http://pymc-devs.github.io/pymc3/) has all the powerful samplers that Stan has (i.e. NUTS) as well as support for discrete priors. It allows you to specify your models in pure Python, supports matrix algebra, and also has variational inference which allows for large-ish scale machine learning (e.g. here is a blog post where I train a Bayesian Neural Net on MNIST http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learn...). Under the hood, it uses theano for JIT compilation to C or GPU.
Anglican for high performance (Clojure host, strong multithreading support) and WebPPL for ease of integration (JavaScript client side or server side). Both are implemented using continuation passing style trampolines.
You don't need to know what that is to use the languages :). In essence, functions always take one argument plus a "continuation," which takes the value that would have otherwise returned. This is called a trampoline. It's typically a compilation technique rather than a programming style.
The purported aim is to allow machine learning code that today requires 1000-10,000 lines of code to be written in 10-100 lines.
This works brilliantly for single shot learning. Let's say you are trying to teach a computer to recognize a handwritten character after a single example. First, you build a generative model that follows how letters are constructed: hand touches paper and makes a primitive shape (line, curve, loop, etc.). Multiple such primitives are strung together to form a character. For each example characters, infer the most likely sequence of such primitives. When a new classification is requested, take samples from the generative model for each known character. Calculate the difference in pixel value, and do this hundreds of thousands of times. You can now construct an accurate marginal probability for each known character while only needing a single example.
Powerful stuff!
[1] efficiency not guaranteed..(yet)
Edit: Here's a few curated resources: http://webppl.org/ http://www.robots.ox.ac.uk/~fwood/anglican/ http://mc-stan.org/ http://dippl.org/ https://probmods.org/