Improving PILCO with a Neural Network Gaussian Process

Rishi Edupalli,Sat Apr 05 2025•Machine Learning

Introduction and Overview

PILCO (opens in a new tab) is a Model-based reinforcement learning alorithm that utilizes Gaussian Process inference to create a simulated environment of a robotics control problem to learn how to solve it from scratch. The algorithm is able to solve non-trivial reinforcement learning problems from scratch with limited data and time. Unfortunately, Gaussian Processes are not the greatest at scaling up to more complex non smooth dynamics models.

To fix this issue, a new varient of PILCO called DeepPILCO was created. DeepPILCO (opens in a new tab) replaced the Gaussian Process in PILCO with a Bayesian Neural Network which allowed the algorithm to model more complex systems. Unfortunately, having a neural network instead of a Gaussian Process requires more data to train and takes away form the efficiency of the algorithm.

Fortunately, a 2017 paper by Lee et al. (opens in a new tab) discovered that some deep neural networks actually converge to a Gaussian process in an infinite limit. An exact relation for this was derived in the form of a Neural Network Gaussian Process and the Neural Tangent Kernel. My research sought to analyze if there would be any performance gains by replacing PILCO's Gaussian Process with a Neural Network Gaussian Process.

A Quick Overview of the PILCO Algorithm

Consider a dynamical system of the form

x_{t+1} = f(x_t,u_t + \epsilon)

where $x_t$ is a vector describing the state of the system at time $t$ , $u_t$ is the control signal applied at time $t$ , $f$ is an unknown function governing the transitional dynamics of the physical system, and $\epsilon$ is a random gaussian noise.

The goal of PILCO is to find the cotroller $\pi(x,\theta) = u_t$ , a deterministic policy which is a function of the current state and parameters $\theta$ that minimizes the sum over the average cost $c(x_t)$ over time $T$

J^{\pi}(\theta) = \sum_{t=0}^T \mathbb{E}[c(x_t)].

To learn the function $f$ , PILCO uses Gaussian Process regression. A Gaussian Process is defined by its mean and covariance functions. PILCO utilizes a mean of zero and the Squared Exponential covariance function (kernel) which is defined as

k(\tilde{x},\tilde{x}’) = \alpha^2 e^{-\frac{1}{2} (\tilde{x} - \tilde{x`})^T \Lambda^{-1}(\tilde{x} - \tilde{x}’)}.

A GP can be thought of as a probability distribution over the possible functions that can approximate the initial givern data which will be generated by applying the initial policy in PILCO. PILCO’s GP uses inputs $(x_{t-1}, u_{t-1})$ and outputs

\Delta t = x_t - x_{t-1} + \epsilon, \epsilon \sim \mathcal{N}(0, \sigma_{\epsilon})

to predict the latent function $f$ which in turn is used to predict the difference between states $\Delta t$ given the control signal $u_t$ and the previous state $x_t$ . This probability distribution is assumed to be Gaussian and gives a one-step prediction of the dynamics of environment. The values $\alpha^2, \Lambda, \sigma_{\epsilon}$ are hyperparameters learned via Expectation Maximization in the original PILCO algorithm.

PILCO creates a long term cost by evalauting these transitions over time $T$ . To do this, the predictive distribution $p(\Delta t)$ needs to be calculated. Although an exact representation can be given by an integral, is is not analytically tractable so it is assumed to be Gaussian and then approximated via Gaussian Moment Matching. $p(x_{t+1})$ is then given by

p(x_{t+1}) \approx \mathcal{N}( x_{t+1} | \mu_{t+1}, \Sigma_{t+1})

where

\mu_{t+1} = \mu_t + \mu_{\Delta}

\Sigma_{t+1} = \Sigma_t + \Sigma_{\Delta} + \text{cov}[x_t, \Delta_t] + \text{cov} [\Delta_t, x_t].

The total cost than given by summing over $t = 0,1, ... T$

\mathbb{E}[ c(x_t) ] = \int c(x_t) \mathcal{N}(x_t | \mu_t, \Sigma_t) dt

for a suitable analytically tractable cost function $c(x_t)$ .

By minimizing $p(x_t, \theta)$ with respect to the cost function $J^{\pi}$ , the parameters of $\theta$ are learned until convergence to a good but not necessarily optimal policy $\pi^{*}$ .

Neural Network Gaussian Process modifications to PILCO

Theoretically speaking, in a very general sense, a simple Neural Network $f$ can be defined as

f(x) = A \phi (Bx)

where the matrices $A$ , $B$ are the parameters of the Neural Network and $\phi$ is an element-wise activation function. The goal of the Neural Network is to then learn a relation between the input data $x$ and the outputs $y$ by minimizing the loss

L(A,B) = \frac{1}{2} \sum_{i=0}^n (y^{(i)} - f(x^{(i)})

given by the sum of the difference of the $i = 1 , 2, \ldots, n$ training samples $y^{(i)}$ and outputs of the Neural Network $f(x^{(i)})$ which can be found via gradient descent

A^{t+1} = A^{(t)} - \alpha \frac{\partial L}{\partial A}

B^{t+1} = B^{(t)} - \alpha \frac{\partial L}{\partial B}

When the training of $B$ is frozen, training a Neural Network under gradient descent is equivalent to kernel regression where $\phi(Bx)$ acts as the map. In this low dimensional case, this is equivalent to linear regression.

Furthermore, when $B$ is sampled from $N(0, I)$ and the dimensions of $A$ and $B$ tend towards infinity, our Neural Network converges to a Gaussian Process. When this infinite limit has a closed form, it is known as a Neural Network Gaussian Process (NNGP). This NNGP has a kernel that depends on the architecture of the neural network and was to act as drop in replacement for PILCO.

Conclusion

The results of my project are detailed in my paper (opens in a new tab)