← Back to blog index

Problem statement

Numerical optimisation is efficient at refinement but fails to discover control policies or behaviours far from the point of initialisation. Sampling can discover further local optima through exploration with simpler cost functions.

Aim. Combine inference and second-order trajectory optimization so the method can efficiently sample when no derivative information is available and follow optimized trajectories when derivatives arise.

Numerical solution

The method relies on the Bellman equation and its tangent-space solution.

\[ v\left(\mathbf{x}, t\right) = \min_{\mathbf{u}}\left[\ell\left(\mathbf{x},\mathbf{u},t\right) + v\left(\mathbf{f}\left(\mathbf{x},\mathbf{u},t\right)\right)\right] \equiv Q\left(\mathbf{x},\mathbf{u},t\right) \]

Solving in tangent space:

\[ Q\left(\delta\mathbf{x},\delta\mathbf{u}\right) \approx \frac{1}{2} \begin{bmatrix}1 \\ \delta\mathbf{x} \\ \delta\mathbf{u}\end{bmatrix}^{\!\top} \begin{bmatrix} 0 & Q_{\mathbf{x}}^{\top} & Q_{\mathbf{u}}^{\top} \\ Q_{\mathbf{x}} & Q_{\mathbf{xx}} & Q_{\mathbf{xu}} \\ Q_{\mathbf{u}} & Q_{\mathbf{ux}} & Q_{\mathbf{uu}} \end{bmatrix} \begin{bmatrix}1 \\ \delta\mathbf{x} \\ \delta\mathbf{u}\end{bmatrix} \]

Update law (Newton-like):

\[ \hat{\mathbf{u}} = \mathbf{u} - \left(Q_{\mathbf{u}} - Q_{\mathbf{ux}}\delta\mathbf{x}\right)Q^{-1}_{\mathbf{uu}} \]

Inference solution

The KL control approach is grounded in the stochastic version of the Bellman equation.

\[ \begin{gathered} \underset{\mathbf{u}}{\min}\left[\ell\left(\mathbf{x}\right) + \mathrm{KL}(\mathcal{Q}\|\|\mathcal{P}) + \mathbb{E}_{\mathcal{Q}}\left[v\left(\mathbf{x}'\right)\right]\right]\\ \mathrm{KL}(\mathcal{Q}\|\|\mathcal{P}) = \mathbb{E}_{\mathcal{Q}}\!\left[\log\!\left(\frac{\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right)}{\mathbf{p}\left(\mathbf{x}'|\mathbf{x}\right)}\right)\right] \end{gathered} \]

Where \(\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right)\) and \(\mathbf{p}\left(\mathbf{x}'|\mathbf{x}\right)\) are the controlled and uncontrolled transition probabilities.

Optimal controls:

\[ \mathbf{p}^*\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right) = \mathbf{p}\left(\mathbf{x}'|\mathbf{x}\right)\exp\left(-v\left(\mathbf{x}'\right)\right)\eta^{-1} \]

Combining inference with second order method

Reformulate the stochastic Bellman objective and add a secondary divergence term to the tangent-space Bellman solution.

\[ \begin{gathered} \underset{\mathbf{u}}{\min}\left[l\left(\mathbf{x}\right) + (1-k)\,\mathrm{KL}(\mathcal{Q}\|\|\mathcal{P}) + k\,\mathrm{KL}(\mathcal{Q}\|\|\mathcal{C}) + \mathbb{E}_{\mathcal{Q}}\left[v\left(\mathbf{x}'\right)\right]\right]\\ \mathrm{KL}(\mathcal{Q}\|\|\mathcal{C}) = \mathbb{E}_{\mathcal{Q}}\!\left[\log\!\left(\frac{\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right)}{\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}_{\mathrm{iLQG}}\right)}\right)\right] \end{gathered} \]

Update law over trajectory distribution:

\[ \mathbf{q}^*\left(\mathbf{W}\right) = \mathbf{p}\left(\mathbf{W}\right)^{1-k}\mathbf{c}\left(\mathbf{W}\right)^{k}\exp\!\left(-\frac{1}{\lambda}S\left(\mathbf{W}\right)\right)\eta^{-1} \]
\[ \mathbf{p}\left(\mathbf{W}\right):\mathbf{w}_t\sim\mathcal{N}\left(0,\mathbf{\Sigma}\right),\quad \mathbf{q}\left(\mathbf{W}\right):\mathbf{w}_t\sim\mathcal{N}\left(\mathbf{u}_t,\mathbf{\Sigma}\right),\quad \mathbf{c}\left(\mathbf{W}\right):\mathbf{w}_t\sim\mathcal{N}\left(\mathbf{u}_{\mathrm{iLQG}_t},\mathbf{\Sigma}_{\mathrm{iLQG}}\right) \]

Importance sampling

We cannot directly sample from \(\mathbf{q}^*(\vec W)\), so we importance sample:

\[ \mathbf{u}^* = \int \mathbf{q}\left(\vec W\right)\,w\left(\vec W\right)\,v_t\,\mathrm{d}W \]

Sample weights:

\[ w\left(\vec W\right) = \frac{1}{\eta}\exp\!\left(-\frac{1}{2\lambda}S\left(\vec W\right) + \sum_{t=0}^{T}k\,\left(\vec w_t-\mathbf{u}_{\mathrm{iLQG}_t}\right)^\top \mathbf{\Sigma}_{\mathrm{iLQG}_t}^{-1}\left(\vec w_t-\mathbf{u}_{\mathrm{iLQG}_t}\right)\right) \]

iLQG distribution is obtained by maximizing \(\ln\exp\left(-\delta Q_t\left(\mathbf{x}_t,\mathbf{u}_t\right)\right)\), with samples from \(\mathcal{C}\) as \(\vec w_t\sim\mathcal{N}\left(\mathbf{u}_{\mathrm{iLQG}_t},Q^{-1}_{\mathbf{uu}_t}\right)\).

Results

Conclusion

Contribution

  • Natural combination of inference and numerical optimisation.
  • Inherent ability to explore.
  • Combine convex and non-convex costs.

Caveats

  • Search across hyperparameters: \(\lambda\), \(\beta\), and \(k\).
  • \(Q_{uu}^{-1}\) is not necessarily a good measure of confidence.
  • Constant control input noise affects convergence.
  • In the worst case, it can perform as badly as the other methods.