Reward Hacking: 2nd Grade Kickball


Misalignment is when the reward function you specify induces an optimal policy that you don’t like. At least that’s how I define it.

Some people call that reward hacking.

I define reward hacking as a specific type of misalignment which is currently easier to point at than define.

Here’s the earliest brush with it that I can remember:

2nd Grade Kickball

This is basically baseball but you get to kick things instead of hauling around some stupid bat (My feelings about baseball never really changed from that day).

While waiting for my turn to kick the ball, I realized the PE teacher had said that we had to kick the ball, but not where we could kick it.

So I kicked it backwards, into the street. Almost hit a car, too.

We somehow cleared all 4 bases before the teacher got a chance to yell at me.

I followed the letter but not the spirit of the law. But I got 4 home runs for it, so what did I care?

After a bunch of arguing, a new rule was put into place to ban that.

So I kicked it onto the roof.

Related Posts

How to Disable Disqus Ads on your Blog

Derivation of Reservoir Sampling

Fun with Python Iterators: Linked Lists Made Easy

Notes for November 11, 2018

Underrated Vim Option: undofile and undodir

Hot Take on Solo Travel: Starve

Alan Perlis

Book Notes: The Map of My Life by Goro Shimura

Prague

Way to remember the definition of local finiteness