Stupid Simple Algorithms: DAgger

My hatred of the fetish for acronyms in academia reminds me of Amberley Vail’s rants against Imperial Guard jargon in the Ciaphas Cain novels.

They could have just called it Dataset Aggregation. It’s not that long of a phrase and it has nothing to do with the weapon anyway.

I’m assuming you’re familiar with the point of imitation learning.

Our expert policy is \(\pi^{*}\), our learner is \(\pi\), and the set of all observations \(x\) and all actions \(a\) is \(\mathbb{D}\).

DAgger is an algorithm which boils down to this simple idea: Try things for yourself, but use your teacher to criticize your actions as long as you have their expertise available.

That’s basically it. Here’s a sketch of the whole process, assuming you have some training data to start with.

  1. Train \(\pi\) with \(\mathbb{D}\).
  2. Execute \(\pi\) to get observations and actions \((x, a)\).
  3. Use \(\pi^{*}\) to tell you what you should have done. AKA, replace \((x, a)\) with \((x, a')\) where \(a' := \pi^{*}(x)\).
  4. Train \(\pi\) with this modified data.
  5. Repeat till \(\pi\) is good enough or your funding runs out or something.

The key thing is that the actions actually taken, \(a\), are picked by the learner, but the expert/teacher tells you what you should have done, with \(a' := \pi^{*}(x)\).

This avoids the issue of the expert and learner policies inducing different trajectories over the state space by aggregating the data they generate. Rather than trying to make the polices exactly the same, you make their data the same (in the limit).

It even works if you sometimes let any other policy (usually the expert) pick the action. It works best if that other policy is better than your own. This is well suited to something like autonomous driving, where you don’t want to hit a tree and you’d still like your training to work.

Related Posts

NTK reparametrization

Kate from Vancouver, please email me

ChatGPT Session: Emotions, Etymology, Hyperfiniteness

Some ChatGPT Sessions

2016 ML thoughts

My biggest takeaway from Redwood Research REMIX

finite, actual infinity, potential infinity

Actions and Flows

PSA: reward is part of the habit loop too

a kernel of lie theory