Optimal Control via Combined Inference and Numerical Optimization

Problem statement

Numerical optimisation is efficient at refinement but fails to discover control policies or behaviours far from the point of initialisation. Sampling can discover further local optima through exploration with simpler cost functions.

Aim. Combine inference and second-order trajectory optimization so the method can efficiently sample when no derivative information is available and follow optimized trajectories when derivatives arise.

Numerical solution

The method relies on the Bellman equation and its tangent-space solution.

\[ v\left(\mathbf{x}, t\right) = \min_{\mathbf{u}}\left[\ell\left(\mathbf{x},\mathbf{u},t\right) + v\left(\mathbf{f}\left(\mathbf{x},\mathbf{u},t\right)\right)\right] \equiv Q\left(\mathbf{x},\mathbf{u},t\right) \]

Solving in tangent space:

\[ Q\left(\delta\mathbf{x},\delta\mathbf{u}\right) \approx \frac{1}{2} \begin{bmatrix}1 \\ \delta\mathbf{x} \\ \delta\mathbf{u}\end{bmatrix}^{\!\top} \begin{bmatrix} 0 & Q_{\mathbf{x}}^{\top} & Q_{\mathbf{u}}^{\top} \\ Q_{\mathbf{x}} & Q_{\mathbf{xx}} & Q_{\mathbf{xu}} \\ Q_{\mathbf{u}} & Q_{\mathbf{ux}} & Q_{\mathbf{uu}} \end{bmatrix} \begin{bmatrix}1 \\ \delta\mathbf{x} \\ \delta\mathbf{u}\end{bmatrix} \]

Update law (Newton-like):

\[ \hat{\mathbf{u}} = \mathbf{u} - \left(Q_{\mathbf{u}} - Q_{\mathbf{ux}}\delta\mathbf{x}\right)Q^{-1}_{\mathbf{uu}} \]

Inference solution

The KL control approach is grounded in the stochastic version of the Bellman equation.

\[ \begin{gathered} \underset{\mathbf{u}}{\min}\left[\ell\left(\mathbf{x}\right) + \mathrm{KL}(\mathcal{Q}\|\|\mathcal{P}) + \mathbb{E}_{\mathcal{Q}}\left[v\left(\mathbf{x}'\right)\right]\right]\\ \mathrm{KL}(\mathcal{Q}\|\|\mathcal{P}) = \mathbb{E}_{\mathcal{Q}}\!\left[\log\!\left(\frac{\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right)}{\mathbf{p}\left(\mathbf{x}'|\mathbf{x}\right)}\right)\right] \end{gathered} \]

Where \(\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right)\) and \(\mathbf{p}\left(\mathbf{x}'|\mathbf{x}\right)\) are the controlled and uncontrolled transition probabilities.

Optimal controls:

\[ \mathbf{p}^*\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right) = \mathbf{p}\left(\mathbf{x}'|\mathbf{x}\right)\exp\left(-v\left(\mathbf{x}'\right)\right)\eta^{-1} \]

Combining inference with second order method

Reformulate the stochastic Bellman objective and add a secondary divergence term to the tangent-space Bellman solution.

\[ \begin{gathered} \underset{\mathbf{u}}{\min}\left[l\left(\mathbf{x}\right) + (1-k)\,\mathrm{KL}(\mathcal{Q}\|\|\mathcal{P}) + k\,\mathrm{KL}(\mathcal{Q}\|\|\mathcal{C}) + \mathbb{E}_{\mathcal{Q}}\left[v\left(\mathbf{x}'\right)\right]\right]\\ \mathrm{KL}(\mathcal{Q}\|\|\mathcal{C}) = \mathbb{E}_{\mathcal{Q}}\!\left[\log\!\left(\frac{\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}\right)}{\mathbf{p}\left(\mathbf{x}'|\mathbf{x},\mathbf{u}_{\mathrm{iLQG}}\right)}\right)\right] \end{gathered} \]

Update law over trajectory distribution:

\[ \mathbf{q}^*\left(\mathbf{W}\right) = \mathbf{p}\left(\mathbf{W}\right)^{1-k}\mathbf{c}\left(\mathbf{W}\right)^{k}\exp\!\left(-\frac{1}{\lambda}S\left(\mathbf{W}\right)\right)\eta^{-1} \]

\[ \mathbf{p}\left(\mathbf{W}\right):\mathbf{w}_t\sim\mathcal{N}\left(0,\mathbf{\Sigma}\right),\quad \mathbf{q}\left(\mathbf{W}\right):\mathbf{w}_t\sim\mathcal{N}\left(\mathbf{u}_t,\mathbf{\Sigma}\right),\quad \mathbf{c}\left(\mathbf{W}\right):\mathbf{w}_t\sim\mathcal{N}\left(\mathbf{u}_{\mathrm{iLQG}_t},\mathbf{\Sigma}_{\mathrm{iLQG}}\right) \]

Importance sampling

We cannot directly sample from \(\mathbf{q}^*(\vec W)\), so we importance sample:

\[ \mathbf{u}^* = \int \mathbf{q}\left(\vec W\right)\,w\left(\vec W\right)\,v_t\,\mathrm{d}W \]

Sample weights:

\[ w\left(\vec W\right) = \frac{1}{\eta}\exp\!\left(-\frac{1}{2\lambda}S\left(\vec W\right) + \sum_{t=0}^{T}k\,\left(\vec w_t-\mathbf{u}_{\mathrm{iLQG}_t}\right)^\top \mathbf{\Sigma}_{\mathrm{iLQG}_t}^{-1}\left(\vec w_t-\mathbf{u}_{\mathrm{iLQG}_t}\right)\right) \]

iLQG distribution is obtained by maximizing \(\ln\exp\left(-\delta Q_t\left(\mathbf{x}_t,\mathbf{u}_t\right)\right)\), with samples from \(\mathcal{C}\) as \(\vec w_t\sim\mathcal{N}\left(\mathbf{u}_{\mathrm{iLQG}_t},Q^{-1}_{\mathbf{uu}_t}\right)\).

Optimal Control via Combined Inference and Numerical Optimization

Problem statement

Numerical solution

Inference solution

Combining inference with second order method

Importance sampling

Results

Conclusion

Contribution

Caveats