Improving PILCO with a Neural Network Gaussian Process

Rishi Edupalli,Machine Learning

Introduction and Overview

PILCO (opens in a new tab) is a Model-based reinforcement learning alorithm that utilizes Gaussian Process inference to create a simulated environment of a robotics control problem to learn how to solve it from scratch. The algorithm is able to solve non-trivial reinforcement learning problems from scratch with limited data and time. Unfortunately, Gaussian Processes are not the greatest at scaling up to more complex non smooth dynamics models.

To fix this issue, a new varient of PILCO called DeepPILCO was created. DeepPILCO (opens in a new tab) replaced the Gaussian Process in PILCO with a Bayesian Neural Network which allowed the algorithm to model more complex systems. Unfortunately, having a neural network instead of a Gaussian Process requires more data to train and takes away form the efficiency of the algorithm.

Fortunately, a 2017 paper by Lee et al. (opens in a new tab) discovered that some deep neural networks actually converge to a Gaussian process in an infinite limit. An exact relation for this was derived in the form of a Neural Network Gaussian Process and the Neural Tangent Kernel. My research sought to analyze if there would be any performance gains by replacing PILCO's Gaussian Process with a Neural Network Gaussian Process.

A Quick Overview of the PILCO Algorithm

Consider a dynamical system of the form

xt+1=f(xt,ut+ϵ)x_{t+1} = f(x_t,u_t + \epsilon)

where xtx_t is a vector describing the state of the system at time tt, utu_t is the control signal applied at time tt, ff is an unknown function governing the transitional dynamics of the physical system, and ϵ\epsilon is a random gaussian noise.

The goal of PILCO is to find the cotroller π(x,θ)=ut\pi(x,\theta) = u_t, a deterministic policy which is a function of the current state and parameters θ\theta that minimizes the sum over the average cost c(xt)c(x_t) over time TT

Jπ(θ)=t=0TE[c(xt)].J^{\pi}(\theta) = \sum_{t=0}^T \mathbb{E}[c(x_t)].

To learn the function ff, PILCO uses Gaussian Process regression. A Gaussian Process is defined by its mean and covariance functions. PILCO utilizes a mean of zero and the Squared Exponential covariance function (kernel) which is defined as

k(x~,x~)=α2e12(x~x~)TΛ1(x~x~).k(\tilde{x},\tilde{x}’) = \alpha^2 e^{-\frac{1}{2} (\tilde{x} - \tilde{x`})^T \Lambda^{-1}(\tilde{x} - \tilde{x}’)}.

A GP can be thought of as a probability distribution over the possible functions that can approximate the initial givern data which will be generated by applying the initial policy in PILCO. PILCO’s GP uses inputs (xt1,ut1)(x_{t-1}, u_{t-1}) and outputs

Δt=xtxt1+ϵ,ϵN(0,σϵ)\Delta t = x_t - x_{t-1} + \epsilon, \epsilon \sim \mathcal{N}(0, \sigma_{\epsilon})

to predict the latent function ff which in turn is used to predict the difference between states Δt\Delta t given the control signal utu_t and the previous state xtx_t. This probability distribution is assumed to be Gaussian and gives a one-step prediction of the dynamics of environment. The values α2,Λ,σϵ\alpha^2, \Lambda, \sigma_{\epsilon} are hyperparameters learned via Expectation Maximization in the original PILCO algorithm.

PILCO creates a long term cost by evalauting these transitions over time TT. To do this, the predictive distribution p(Δt)p(\Delta t) needs to be calculated. Although an exact representation can be given by an integral, is is not analytically tractable so it is assumed to be Gaussian and then approximated via Gaussian Moment Matching. p(xt+1)p(x_{t+1}) is then given by

p(xt+1)N(xt+1μt+1,Σt+1) p(x_{t+1}) \approx \mathcal{N}( x_{t+1} | \mu_{t+1}, \Sigma_{t+1})

where

μt+1=μt+μΔ\mu_{t+1} = \mu_t + \mu_{\Delta} Σt+1=Σt+ΣΔ+cov[xt,Δt]+cov[Δt,xt].\Sigma_{t+1} = \Sigma_t + \Sigma_{\Delta} + \text{cov}[x_t, \Delta_t] + \text{cov} [\Delta_t, x_t].

The total cost than given by summing over t=0,1,...Tt = 0,1, ... T

E[c(xt)]=c(xt)N(xtμt,Σt)dt\mathbb{E}[ c(x_t) ] = \int c(x_t) \mathcal{N}(x_t | \mu_t, \Sigma_t) dt

for a suitable analytically tractable cost function c(xt)c(x_t).

By minimizing p(xt,θ)p(x_t, \theta) with respect to the cost function JπJ^{\pi}, the parameters of θ\theta are learned until convergence to a good but not necessarily optimal policy π\pi^{*}.

Neural Network Gaussian Process modifications to PILCO

Theoretically speaking, in a very general sense, a simple Neural Network ff can be defined as

f(x)=Aϕ(Bx)f(x) = A \phi (Bx)

where the matrices AA, BB are the parameters of the Neural Network and ϕ\phi is an element-wise activation function. The goal of the Neural Network is to then learn a relation between the input data xx and the outputs yy by minimizing the loss

L(A,B)=12i=0n(y(i)f(x(i))L(A,B) = \frac{1}{2} \sum_{i=0}^n (y^{(i)} - f(x^{(i)})

given by the sum of the difference of the i=1,2,,ni = 1 , 2, \ldots, n training samples y(i)y^{(i)} and outputs of the Neural Network f(x(i))f(x^{(i)}) which can be found via gradient descent

At+1=A(t)αLAA^{t+1} = A^{(t)} - \alpha \frac{\partial L}{\partial A} Bt+1=B(t)αLBB^{t+1} = B^{(t)} - \alpha \frac{\partial L}{\partial B}

When the training of BB is frozen, training a Neural Network under gradient descent is equivalent to kernel regression where ϕ(Bx)\phi(Bx) acts as the map. In this low dimensional case, this is equivalent to linear regression.

Furthermore, when BB is sampled from N(0,I)N(0, I) and the dimensions of AA and BB tend towards infinity, our Neural Network converges to a Gaussian Process. When this infinite limit has a closed form, it is known as a Neural Network Gaussian Process (NNGP). This NNGP has a kernel that depends on the architecture of the neural network and was to act as drop in replacement for PILCO.

Conclusion

The results of my project are detailed in my paper (opens in a new tab)